1. Objective
	This section introduces the Apache Hive data warehouse framework where participants will learn about:
	1.1 Main Hive features
	1.2 Hive Architecture
	1.3 HiveQL Basics

2. What is Hive
	2.1 Hive is a Hadoop-based data warehouse framework/infrastructure written in Java
	2.2 Works with large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem
	2.3 Originally developed by Facebook to cope with the growing volumes of data in their data warehouse that they were not able to solve using
		traditional technologies
		
3. Hive's Value Proposition
	3.1 Hive sheilds non-programmers (business analysts, etc.) from programming MapReduce jobs with its own SQL-like query language called HiveQL
	3.2 Internally HiveQL queries are converted into actual MapReduce jobs submitted to Hadoop for execution against files stored on HDFS
	3.3 In YARN, in addition to MapReduce, Hive also supports Apache Tex and Apache Spark execution engines (MR is used by Hive by default)
	3.4 HiveQL is extendable via new types, user functions and scripts
	3.5 Custom map and reduce modules can be plugged in HiveQL for fine-tuned processing logic
	
	Note:
		In addition to MapReduce, the newer versions of Hive can leverage alternative execution engines, such as Apache Spark and Apache Tez. Switching between execution engines is 
		as simple as setting Hive's hive.execution.engine system property to the target engine string value in your Hive shell session. For example:
		set hive.execution.engine = tez;
		set hive.execution.engine = spark;
		The default value for this configuration is "mr"
	
4. Who uses Hive
	4.1 Facebook
	4.2 Netflix
	4.3 AWS
	
5. Hive's Main Sub-Systems
	5.1 Hive consists of three main sub-systems:
		5.1.1 Metadata store (metastore) for schema information (table column data types, related files location etc.) is stored in an embedded or external database
		5.1.2 Another Hive's sub-system called HCatalog sits on top of the metastore and provides access to the metastore by other Hadoop-centric products, such as Pig and MapReduce jobs
		5.1.3 HCatalog provides RESTful interface to the metastore through its web server called WebHCat
	5.2 Serization/deserialization framework for reading and writing data
	5.3 Query processor that translates HiveQL statements into MapReduce instructions	
	
6. Hive Features
	6.1 Support for four file formats: TEXTFILE, SEQUENCEFILE, RCFILE, PARQUET and ORC
	6.2 Bitmap indexing with other index types
	6.3 Metadata storage in an internal (embedded) or external RDBMS
	6.4 Ability to work on data compressed with gzip, bzip2 and other algorithm
	6.5 Support for User-Definded Functions
	6.6 SQL-like query language (HiveQL)
	6.7 Support for JDBC, ODBC and Thrift clients
	6.8 Hive lends itself to integration with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics etc.
		In many cases, integration is done using Hive's high performance ODBC driver which supports all SQL-92 interfaces
	Note:
		RCFILE (Record Columnar File) is a data storage format used to store relational tables on a cluster of computers. The RCFILE format has been developed for data processing with
		MapReduce frameworks.
		The ORC (Optimized Row Columnar) file format provides a highly efficient way to store Hive data.
		ORC has the following advantages over the RCFile format:
			Reduced load on the NameNode
			Support for an extended range of types: datetime, decimal, and the complex types (struct, list, map and union)
			Light-weight indexes stored within the file
			Seeking for a given row in the data set
			Data compressionn
			Efficient and flexible metadata storage using Google's Protocol Buffers
		Apache Thrift is a cross-language data serialization framework which is used to build RPC clients and servers.	
	
7. The "Classic" Hive Architecture
	CLI -> Driver -> HADOOP
	JDBC/ODBC -> Thrift Server -> Driver -> HADOOP 
	Web GUI -> Driver -> HADOOP

8. The New Hive Architecture
	8.1 Hive architecture was redesigned as a client/server system to enable various remote clients to execute queries against the Hive Server, referred to as HiveServer2
	8.2 The new architecture provides better support for multi-client concurrency and authentication
	8.3 In addition to the original Hive CLI, Hive 2 also supports the new CLI, called Beeline and JDBC, ODBC clients
		Beeline -> HiveServer2 -> Hadoop
		JDBC Client -> HiveServer2 -> Hadoop
		ODBC Client -> HiveServer2 -> Hadoop
		Hive CLI -> HiveServer2 -> Hadoop

9. HiveQL
	9.1 SQL-like query language developed to hide complexities of manually coding MapReduce jobs
	9.2 HiveQL does not fully implement SQL-92 standard
	9.3 It has Hive-specific syntactic extensions
	9.4 Hive is suited for batch processing over large sets of immutable data (e.g. web logs)
		UPDATE or DELETE operations are not supported;but INSERT INTo is acceptable
	9.5 HiveQL does not support transactions
		Full ACID transactional support is planned for future releases
	9.6 Hive often works more efficiently on denormalized data
	9.7 Generally, HiveQL statements are case-intensitive	
	
	
10. Where are the Hive Tables Located?
	10.1 Location of the tables that you create in Hive is specificed in the hive.metastore.warehouse.dir property of the hive-site.xml configuration file:
		<property>
			<name>hive.metastore.warehouse.dir</name>
			<value>/user/hive/warehouse</value>
		</property>
	10.2 The value of the property points to the directory on HDFS
	10.3 For the above value, you can verify the existence of the warehouse directory by running the following command:
		hadoop fs -ls /user/hive/warehouse
	Note:
		Setting up Hive, among other steps, requires the following steps to be performed:
		Have Hadoop in your path or issue the following command:
		export HADOOP_HOME=<hadoop-install-dir>
		You need to create the /user/hive/warehouse and /tmp folders
		Then, you need set apply the chmod g+w HDFS command to give write permissions to the user in the working groups so that that user can create tables in Hive.
		
		Command to perform this setup are:
		$HADOOP_HOME/bin/hadoop fs  -mkdir /tmp
		$HADOOP_HOME/bin/hadoop fs  -mkdir /user/hive/warehouse
		$HADOOP_HOME/bin/hadoop fs  -chmod g+w /tmp
		$HADOOP_HOME/bin/hadoop fs  -chmod g+w /user/hive/warehouse

10a. Add Hive Env to the system path by opening /etc/profile or ~/.bashrc
	10a.1 Enable the settings immediately:
		$source /etc/profile
		
10b. Change temporary data file path
	10b.1 Modify the configuration file at $HIVE_HOME/conf/hive-site.xml
		hive.exec.scratchdir: This is the temporary data file path. By default it is /tmp/hive-${user.name}.

10c. Change Hive metadatastore
	By default, Hive uses the MySQL database as the metadatastore. Hive can also use other databases, such as PostgreSQL as the metadata store. 
	
	To configure Hive to use other databases, the following parameters should be configured:
		javax.jdo.option.ConnectionURL // the database URL
		javax.jdo.option.ConnectionDriverName // the JDBC driver name
		javax.jdo.option.ConnectionUserName // database username
		javax.jdo.option.ConnectionPassword // database password

	The following is an example setting using MySQL as the metastore database:
		<property>
			<name>javax.jdo.option.ConnectionURL</name>
			<value>jdbc:mysql://sandbox.hortonworks.com/hive?createDatabaseIfNotExist=true</value>
			<description>JDBC connect string for a JDBC metastore</description>
		</property>
		<property>
			<name>javax.jdo.option.ConnectionDriverName</name>
			<value>com.mysql.jdbc.Driver</value>
			<description>Driver class name for a JDBC metastore</description>
		</property>	
		<property>
			<name>javax.jdo.option.ConnectionUserName</name>
			<value>hive</value>
			<description>username to use against metastore database</description>
		</property>
		<property>
			<name>javax.jdo.option.ConnectionPassword</name>
			<name>javax.jdo.option.ConnectionPassword</name>
			<value>hive</value>
			<description>password to use against metastore database</description>
		</property>

	Make sure the MySQL JDBC driver is available at $HIVE_HOME/lib.

		
11. "Legacy" Hive Command-line Interface (CLI)
	11.1 Hive offers command-line interface (CLI) through its hive utility that can be used in two modes of operations:
		An interative mode, where the user enters HiveQL-based queries manually
		In unattended (batch) mode, where the hive utility takes commands on the command-line as arguments
		
12. The Beeline Command Shell
	12.1 HiveServer2 comes with its own CLI called Beeline
	12.2 Beeline is a JDBC client that is based on the SQLLine Client
	12.3 The Beeline shell supports two operational modes: embedded and remote
		In embedded mode, Beeline runs an embedded Hive (similar to Hive CLI)
		In remote mode, it connects to a separate HiveServer2 process over Thrift
	12.4 You can run Hive CLI commands from Beeline

12a. The Comparsion of Beeline and CLI Command-line Syntax

	Purpose 						HiveServer2 Beeline 								HiveServer1 CLI
	Server connection 				beeline –u <jdbcurl> -n <username> -p <password> 	hive -h <hostname> -p <port>
	Help 							beeline -h or beeline --help 						hive -H
	Run query 						beeline -e <query in quotes>						hive -e <query in quotes>
									beeline -f <query file name>						hive -f <query file name>	
	Define variable					beeline --hivevar key=value.						hive --hivevar key=value
	Enter mode beeline hive			Connect !connect <jdbcurl> 							n/a
	List tables 					!table 												show tables;
	List columns 					!column <table_name> 								desc <table_name>;
	Run query 						<HQL query>; 										<HQL query>;
	Save result set 				!record <file_name>									N/A
									!record	
	Run shell CMD					!sh ls												!ls;
	Run dfs CMD 					dfs -ls 											dfs -ls;
	Run file of SQL 				!run <file_name> 									source <file_name>;
	Check Hive version 				!dbinfo !hive 										--version;
	Quit mode 						!quit 												quit;									

	For Beeline, ; is not needed after the command that starts with !.
	When running a query in Hive CLI, the MapReduce statistics information is shown in the
	console screen while processing, whereas Beeline does not.
	
	Hive CLI shows the exact line and position of the Hive query or syntax errors when the
	query has multiple lines. However, Beeline processes the multiple-line query as a single
	line, so only the position is shown for query or syntax errors with the line number as 1 for
	all instances. For this aspect, Hive CLI is more convenient than Beeline for debugging the
	Hive query.

12b. The Hive-integrated Development Environment

	Besides the command-line interface, there are a few integrated development
	environment (IDE) tools available for Hive development. One of the best is Oracle SQL
	Developer, which leverages the powerful functionalities of Oracle IDE and is totally free
	to use. If we have to use Oracle along with Hive in a project, it is quite convenient to
	switch between them only from the same IDE.
	Oracle SQL developer has supported Hive since version 4.0.3. Configuring it to work with
	Hive is quite straightforward. The following are a few steps to configure the IDE to
	connect to Hive:
	
	1. Download Hive JDBC drivers from the vendor website, such as Cloudera.
	2. Unzip the JDBC version 4 driver to a local directory.
	3. Start Oracle SQL Developer and navigate to Preferences | Database | Third Party JDBC Drivers.
	4. Add all of the JAR files contained in the unzipped directory to the Third-party JDBC Driver Path setting as follows:
	
	
	5. Click on the OK button and restart Oracle SQL Developer.
	6. Create new connections in the Hive tab giving a proper Connection Name,Username, Password, Host name (Hive server hostname), Port, and Database.
		Then, click on the Add and Connect buttons to connect to Hive.
		
	
	In Oracle SQL Developer, we can run all Hive interactive commands as well as Hive
	queries. We can also leverage the power of Oracle SQL Developer to browse and export
	data into a Hive table from the graphic user interface and wizard.
	
	
	
13. Summary
	13.1 Hive is a data warehouse framework/infrastructure written in Java that runs on top of Hadoop
	13.2 Hive offers users an SQL-like query language called HiveQL which shields users from complexities of programming MapReduce jobs directly
	13.3 Hive also offers a CLI through its hive utility that can be used in two modes of operations (Interative and batch)
	13.4 Beeline, the CLI for Hive 2, provides support for Hive Commands
	13.5 We also looked into a few of the Hive interactive commands and queries in Hive CLI, Beeline, and IDEs. After going through this chapter, we should be able to set up our
		 own Hive environment locally and use Hive from CLI or IDE tools.

	
	
14. Objective
	This section introduces the Hive command-line interface that allows users to execute HiveQL scripts in two modes:
		Interative mode	
		Unatended (batch) mode
	

15. Hive Command-line Interface (CLI)
	15.1 Hive CLI is baed on the command-line hive shell utility which can be used to execute HiveQL commands in either interative or unattended (batch) mode:
		15.1.1 In interative mode, the user enters HiveQL-based queries manually, submitting commands sequentially
		15.1.2 In unattended (batch) mode, the hive shell takes command parameters on the command-line
	15.2 Hive CLI also supports variable substitution that helps create dynamic scripts
	

16. The Hive Interactive Shell
	16.1 You start the interative shell by running the hive command
		To suppress information messages printed while you are in the shell, start the shell with the -S (silent) flag:
		hive -S
	16.2 After successful initialization, you will get the hive> command prompt
	16.3 To end the shell session, enter quit; or exit; at the command prompt
		Don't forget a semi-colon; at the end of each command
	16.4 Command syntax of the Hive shell was influenced by MySQL connad-line, e.g.
		SHOW tables; DESCRIBE sample_07;

17. Running Host OS Commands from the Hive Shell
	17.1 The Hive shell supports execution of Host OS (operation system)
	17.2 OS commands must be prefixed with "!" and terminated with ";"
		e.g. to print the current working directory, issue the following command:
		hive>!pwd;
	
18. Interfacing with HDFS from the Hive Shell
	18.1 Use the dfs command while in the shell to get access to the HDFS APU
	18.2 For example,
		18.2.1 To get the listing of files in the user home directory in HDFS, issue this command:
			hive>dfs -ls;
		18.2.2 To remove a file HDFS skipping HDFS's Trash bin;
			hive>dfs -rm -skipTrash myDS;
	Note:
		The dfs command returns a reference to the HDFS command-line interface so to get help on dfs commands, run the following command from the host OS terminal (not from the Hive shell):
		hadoop fs -help
		
19. The Hive in Unattended Mode
	19.1 In additional to the interactive shell interface, Hive supports invocation of commands in unattended mode
	19.2 Commands to execute on command-line are preceded by the '-e' flag
	19.3 This mode is suitable for executing short commands that can be issued directly against Hive, e.g.
		$hive -e 'SHOW TABLES;'
	19.4 You can submit more than one ommand, just use the ";" command separator, e.g.:
		$hive -e 'SHOW TABLES;DESCRIBE sample_07;'
	19.5 To suppress information messages, use the -S (silent) flag
	19.6 To get the CLI help, run the hive -H command	

20. The Hive CLI Integration with the OS Shell
	20.1 The Hive CLI can be easily integrated with the underlying OS shell from which the hive utility is launched
		e.g.
		$hive -S -e 'SELECT * FROM sample_07 LIMIT 3;' > /tmp/sample_07.dat
		The output of the above command is redirected into the /tmp/sample_07.dat file on the file system
	20.2 More sophisticated command chains can be built using Unix command pipelines	

21. Executing HiveQL Scripts
	21.1 Hive can take scripts that contain one or more HiveQL commands for execution in non-interative (a.k.a unattended or batch) mode, e.g.:
		Use any text edit tools to add the following statement to myscript.hql: select * from sample_07;
		e.g. vi myscript.hql
		$hive -f myscript.hql
	21.2 The HiveQL scripts may have any extensions; by convention, the .hql extension is used
	21.3 To execute a script file from inside the Hive shell, use the source command
		hive>source /root/myscript.hql;
		
	
22. Comments in Hive Scripts
	22.1 When you execute a file script using the command line, e.g. hive -f myscript.hql, the script may have comments which are denoted as '--' in
		front of the comment line, e.g.
		-- Monthly report data generator
		--------------------------------
	22.2 Comments don't work inside the interative Hive shell	

23. Variables and Properties in Hive CLI
	23.1 Using command-line, you can define custom variable (a.k.a properties) that you can later use in your Hive scripts using the ${varname} syntax
		Hive replaces references to variables with their values and then submits the query
	23.2 This facility aids in creating dynanic scripts
	
24. Setting Properties in CLI
	24.1 For setting properties when executing commands in unattended mode, you have two functionally equivalent operations:
		--define key1=value1
		--hivevar ley1=value1
	24.2 The above assignments add the key1=value1 key-value pair in the hivevar namespace for the duration of the script execution session
	

25. Example of Setting Properties in CLI
	25.1 The command below shows how to set a property (variable) on command-line
		hive -S -e 'DESCRIBE ${tblName};' --define tblName=sample_07
		The above command will execute the command in silent mode (-S)
		We define the variable tblName as having the value of foo
		The Hive shell will substitute the ${tbName} parameter with its value (sample_07) and then execute the command as 'DESCRIBE sample_07;' which will show the structure of the table sample_07
		


26. Hive Namespaces
	26.1 Hive namespaces help group variables into buckets of functioanally similar properties
	26.2 Hive supports four namespaces (access types are: R for read-only and W for Write-enabled)
	
		Namespaces					Access Type				Description
		
		hivevar						R/W						User-defined variables (properties)
		hiveconf					R/W						Hve configuration properties
		system						R/W						Java system properties
		env							R/W						Shell environment variables

27. Using the SET Command
	27.1 In the Hive shell, you read/write variables using the SET command
	27.2 Except for the user-defined hivevar and hiveconf namespaces, you need to prefix the variable with namespace it belongs to
	27.3 For example:
		27.3.1 To read the OS user's home directory (which is a property of the env namespace)
			hive> SET env:HOME;
		27.3.2 Using the SET command without the argument, will print all available variables in all the namespaces:env, hivevar, hiveconf and system
			hive>SET;
			Note:
				To get extensive information on the Hadoop coniguration, use the SET -v; command
				

28. Setting Properties in the Shell
	28.1 To set the value of a writable (settable) property in the interative shell, use the following command:
		SET [namespace:]myVar=myVarValue;
		The above command is also used to create the variable if it does not existence
	28.2 Inside the interative Hive shell, the hivevar and hiveconf namespaces can be omitted
	28.3 You can verify the new property's value by running the following command:
		SET [namespace:]myVar;
	28.4 You can reset the configuration to the default values by using the reset command
	

29. Setting Properties for the New Shell Session
	29.1 One of the useful features of the property facility is the ability to set properties before invoking the interative shell
		$hive --define var1=val1 --define var2=val2
		The above command will set two session-side properties var1 and var2 that can be used in the launched Hive interative shell session
		hive>set var1;
		hive>set var2;

30. Setting Alternative Hive Execution Engines
	30.1 By default, Hive uses the MR engine
	30.2 You can configure to run Hive on alternative execution engines:
		Apache Tez:
		set hive.execution.engine=tex;
		Apache Spark;
		set hive.execution.engine=spark;
	30.3 The default value for this configuration is set as follows:
		set hive.execution.engine = mr;

31. Summary
	31.1 The Hive CLI is build around the Hive shell utility which allow users to run commands in two modes:
		Interative mode, where users enter HiveQL-based queries manually inside the shell
		Unattended (batch) mode, where the Hive shell is invoked from the host OS command-line and executes commands passed on as arguments
		
32. Objective
	This section provides an extended overview Hive's Data Definition Language (DDL), participants will learn about
	32.1 Creating and Dropping Hive Databases
	32.2 Creating and Dropping Tables in Hive
	32.3 Supported Data Type Categories
	32.3 Table Partitioning
	32.4 The EXTERNAL Keyword
	32.5 Hive Views

33. Hive Data Definition Language
	33.1 The following commands are supported by the Hive Data Definition Language (DDL)
		Create/Drop/Alter Databasee
		Create/Drop/Truncate Table
		Alter Table/Partition/Column
		Create/Drop/Alter View
		Create/Drop Indexe
		Show
		Describe

34. Creating Databases in Hive
	34.1 When you start working with Hive, you are provided with the default database which is sufficient in most cases
	34.2 To create a new database in Hive, issue the following command:
		CREATE (DATABASE|SCHEMA) [IF NOT EXIST] database_name [COMMENT database_comment] [LOCATION hdfs_path]
	Note:
		The SCHEMA and DATABASE terms are used interchangeably

35. Using Databases
	35.1 Create the database without checking whether the database already exists:
		CREATE DATABASE myhivedatabase;
	35.2 Create the database and check whether the database already exists:
		CREATE DATABASE IF NOT EXISTS myhivedatabase;
	35.3 Create the database with location, comments, and metadata information:
		CREATE DATABASE IF NOT EXISTS myhivedatabase
		COMMENT 'my hive database'
		LOCATION '/user/root/hivedatabase'
		WITH DBPROPERTIES ('creator'='tester','date'='2016-10-05');
	35.1 To show available databases on the system, issue this command in the Hive shell:
		hive>SHOW DATABASES;
	35.2 The USE command sets the current working database, e.g:
		hive>USE default;
	35.3 To drop a database, issue this command:
		DROP (DATABASE|SCHEMA)[IF EXISTS] database_name
	
		Note that Hive keeps the database and the table in directory mode. In order to remove
		the parent directory, we need to remove the subdirectories first. By default, the
		database cannot be dropped if it is not empty, unless CASCADE is specified. CASCADE
		drops the tables in the database automatically before dropping the database.
		
		DROP DATABASE IF EXISTS myhivedatabase CASCADE;

36. Creating Tables in Hive
	36.1 To create a table, use the following (simplified) table creation statement:
		CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name [(col_name data_type [commnt col_comment], ...)]
		[COMMENT table_comment]
		[PARTITIONED BY (col_name data_type [COMMENT col_comment],...)]
		[ROW FORMAT row_format]
		DELIMITED [FIELDS TERMINATED BY char] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char]
		[STORED AS file_format]
		[LOATION hdfs_path]
		
		Note:
			The CREATE TABLE statement supports inserting string comments for documenting purposes
	36.2 Before creating a new table, you can check the list of existing tables by issuing the following command:
		SHOW TABLES;
	36.3 Hive stores the table data in a directory of the directory defined by hive.metastore.warehouse.dir property of the hive-site.xml configuration
		file (the default value is /user/hive/warehouse)
	36.4 Check the table
		Describe employee;
		Describe extended employee;

37. Supported Data Type Categories
	37.1 The following data type categories are supported in Hive DDL:
		Primitive types
		Array type (holds elements of the same data type)
		Map type (holds maps of data types as key-values pairs)
		Struct (holds a C-like structure of grouped elements)
		Union (C-like union types)

38. Common NUmeric Types
	38.1 The following numeric types are supported:
		TINYINT				1 byte signed integer
		SMALLINT			2 byte signed integer
		INT					4 byte signed integer
		BIGINT				8 byte signed integer
		BOOLEAN				{true|false}
		FLOAT				4-byte single precision floating point number
		DOUBLE				8-byte double precision floating point number
		DECIMAL				


		
39. String and Date/Time Types
	39.1 STRING
		String literals can be used with either single quotes (') or double quotes ("). C-style escaping within the strings (e.g. '\t') is supported
	39.2 VARCHAR 
		Created with a length specifier (between 1 and 65355)
	39.3 CHAR
		Similar to VARCHAR but fixed-length (1 to 255); values shorter than the specified length value are padded with spaces
	39.4 TIMESTAMP
		Traditional Unix timestamp with optional nanosecond precison
		timestamp text should be in the YYYY-MM-DD HH:MM:SS[.fffffffff] format
	39.5 Date
		In the YYYY-MM-DD format without the time of day participants
	Note:
		TIMESTAMP supported conversion:
			Integer numeric types: Interpreted as UNIX timestamp in seconds
			Floating point numeric types: Intepreted as UNIX timestamp in seconds with decimal precison
			Strings: JDBC compliant java.sql.Timestamp format YYYY-MM-DD HH:MM:SS[.fffffffff]
			

40. Miscellaneous Types
	40.1 BOOLEAN
	40.2 BINARY
	
41. Example of the CREATE TABLE Statement
	41.1 Prepare the data as follows:
		Use any text editor to create the file employee.txt and add the following data into employee.txt file:
		
		Michael|Shanghai,Toronto|Male,30|DB:80|Product:Developer^DLead
		Peter|London|Male,35|Perl:85|Product:Lead,Test:Lead
		Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
		Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
		
	Note:
	The ^D should be input lie the following:
	Press ALT and Hold ALT, then press 004	
		
	41.1 CREATE TABLE employee(
			name string		COMMENT 'name', 
			work_place ARRAY<string>	COMMENT 'Work Place',
			gender_age STRUCT<gender:string,age:int>	COMMENT 'Gender and Age',
			skills_score MAP<string,int>	COMMENT 'Skill Score',
			depart_title MAP<string,ARRAY<string>>	COMMENT 'Department and Title')
		ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
		COLLECTION ITEMS TERMINATED BY ','
		MAP KEYS TERMINATED BY ':';		
	
	41.1 DESCRIBE employee;

	41.1 LOAD DATA LOCAL INPATH '/root/TrainingOnHDP/dataset/employee.txt' OVERWRITE INTO TABLE employee;	
	
	41.1 SELECT * FROM employee;
	
	41.2 Hive uses Java generics' syntax (<DATA_TYPE> for specifying the type of data held in ARRAYs and MAPs)
	
	41.3 CREATE TABLE employee_id (
			name string,
			employee_id int,
			work_place ARRAY<string>,
			sex_age STRUCT<sex:string,age:int>,
			skills_score MAP<string,int>,
			depart_title MAP<STRING,ARRAY<STRING>>)
		ROW FORMAT DELIMITED
		FIELDS TERMINATED BY '|'
		COLLECTION ITEMS TERMINATED BY ','
		MAP KEYS TERMINATED BY ':';

	41.4 LOAD DATA LOCAL INPATH '/root/TrainingOnHDP/dataset/employee_id.txt' OVERWRITE INTO TABLE employee_id;

	41.5 CREATE TABLE IF NOT EXISTS employee_hr(
		name string,
		employee_id int,
		sin_number string,
		start_date date)
		ROW FORMAT DELIMITED
		FIELDS TERMINATED BY '|'
		STORED AS TEXTFILE;

	41.6 LOAD DATA LOCAL INPATH '/root/TrainingOnHDP/dataset/employee_hr.txt' OVERWRITE INTO TABLE employee_hr;	
	
42. Working with Complex Types
	42.1 The COLLECTION ITEMS TERMINATED BY char clause defines the delimiter character, e.g. ','
	42.2 The collection items can be struct members, array elements, and map key-value pairs
	42.3 For example, with a ',' used as the delimiting char:
		gender_age STRUCT<gender:string,age:int> will require that the underlying rows will have its elements grouped as follows:
		Male,30
		Female,27
		
		You access struct elements as follows:
		SELECT gender_age.gender, gender_age.age from employee;
		
		Query the whole struct and each struct column in the table:
		SELECT gender_age FROM employee;
		
		work_place ARRAY<INT> will require elements of the array be packed as follows:
		Montreal,Toronto
		
		To read the first element (Montreal) of the commands array in all rows:
		SELECT work_place[0] FROM employee;
	
		Query the whole array and each array column in the table:
		SELECT work_place FROM employee;
		
		SELECT work_place[0] AS col_1, work_place[1] AS col_2, work_place[2] AS col_3 FROM employee;
	
	
43. Working with Complex Types
	43.1 MAP KEYS TERMINATED BY ':' would require you separate keys from values with a ':'
	43.2 The COLLECTION ITEMS TERMINATED BY char clause controls the key-value pair separator
	43.3 The DELIMITED FIELDS TERMINATED BY char clause controls the field separator
	43.4 The decisions MAP<STRING, INT> definition would require the input file elements representing the map be grouped as follows:
			Python:80
			Test:Lead,COE:Architect
			
			In order to read values keyed, for example, by "Python" from the decision map:
			
			SELECT skills_score["Python"] FROM employee;
			
			
			Query the whole map and each map column in the table:
			
			SELECT skills_score FROM employee;
			
			SELECT name, skills_score['DB'] AS DB, skills_score['Perl'] AS Perl, skills_score['Python'] AS Python, skills_score['Sales'] as Sales, skills_score['HR'] as HR FROM employee;

43. Working with Complex Types
	43.1 HIVE nested ARRAY in MAP data type
			Query the composite type in the table:
			
			SELECT depart_title FROM employee;
			
			SELECT name, depart_title['Product'] AS Product, depart_title['Test'] AS Test, depart_title['COE'] AS COE, depart_title['Sales'] AS Sales FROM employee;
			
			SELECT name, depart_title['Product'][0] AS product_col0, depart_title['Test'][0] AS test_col0 FROM employee;

43. Data type conversions
	43.1 Hive supports both implicit type conversion and explicit type conversion
	43.2 Primitive type conversion from a narrow to a wider type is known as implicit conversion. However, the reverse conversion is not allowed. All the integral numeric types, FLOAT, and
		STRING can be implicitly converted to DOUBLE, and TINYINT, SMALLINT, and INT can all be converted to FLOAT. BOOLEAN types cannot be converted to any other type.
	43.3 Explicit type conversion is using the CAST function with the CAST(value AS TYPE) syntax.
		For example, CAST('100' AS INT) will convert the string 100 to the integer value 100. If
		the cast fails, such as CAST('INT' AS INT), the function returns NULL. In addition, the
		BINARY type can only cast to STRING, then cast from STRING to other types, if needed.	
		
		SELECT CAST(gender_age.age AS SMALLINT) AS age FROM employee;	
		
			
44. Table Partitioning
	44.1 Table partitioning significantly improves query performance by having Hive store table data in physically isolated directories and files
	44.2 Partitioning is done on partitioning pseudo-column(s) that you specify in the PARTITIONED BY clause
		PARTITIONED BY (make STRING, year SMALLINT);
	44.3 Partitioning columns usually used in the WHERE clause of your HiveQL queries, e.g.
		SELECT * FROM CARS WHERE make='Ford' AND year=2014;
	44.4 The partitioning columns must not be repeated in the table definition itself
		If they do, you will get this error: "FAILED:Error in semantic analysis:Column repeated in partitioning columns"

45. Table Partitioning
	45.1 Table partitioning changes Hive table table directory structure on the data storage
	45.2 The table directory will have new sub-directories that reflect the sequence of columns in the PARTITIONED BY clause as well as discrete values in the partitioning columns, e.g.:
		CREATE TABLE CAR (
		PARTITIONED BY (make STRING));
		clause will create a bunch of .../CAR/make=<car_make> directories, e.g.
			.../CAR/make='Acura'
			.../CAR/make='Ford'
			.../CAR/make='Nissan'
	Note:
		The specific make=<CAR_MAKE> directory now must have all the data that belongs to the particular CAR_MAKE;
		The make column is not repeated in the table definition
	45.3 Partitioning makes data storage more efficient	

46. Table Partitioning on Multiple Columns
	46.1 Partitioning a table on more than once column will create a hierarchy of sub-directories that follows the sequence of columns in the PARTITION BY clause
		CREATE TABLE CAR(
		...
		PARTITIONED BY (make STRING, year SMALLINT);
		clause will create a bunch of
			.../CAR/make=<CAR)MAKE_VALUE>/year=<YEAR>
		sub-directories of the CAR table, e.g.
			.../CAR/make='Acura'/year=2012/
			.../CAR/make='Acura'/year=2013/
			.../CAR/make='Acura'/year=2014/
			.../CAR/make='Ford'/year=2012/
			.../CAR/make='Ford'/year=2013/
			.../CAR/make='Ford'/year=2014/
			.../CAR/make='Nissan'/year=2012/

47. Viewing Table Partitions
	47.1 The following command lists the partitions defined on a table:
		SHOW PARTITIONS <table_name>
	47.2 You can narrow down a list of returned partitions defined on a table, e.g.
		SHOW PARTITIONS car PARTITION (make='Ford');
			make='Ford'/year=2012
			make='Ford'/year=2013
			make='Ford'/year=2014
		

48. Row Format
	48.1 Declared by the [ROW FORMAT row_format] DDL clause
	48.2 If the ROW FORMAT section not specified, the default ROW FORMAT DELIMITED .. format is used by Hive
	48.3 The following most common formats are supported:
		48.3.1 DELIMITED FIELDS TERMINATED BY <char>, e.g.
			ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
		48.3.2 SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value,...)]
		

49. Data Serializers/Deserializers
	49.1 Hive supports reading/writing data using specific data serializers/deserializers (The SERDE ... DDL fragment)
	49.2 Reading/Writing of data using Serde occurs as follows
		HDFS files --> Deserializer --> Row Object in Hive
		Row object in Hive --> Serializer --> HDFS files
	49.3 Hive has the following built-in SerDe protocols
		Avro
		ORC
		RegEx
		Thrift
	49.4 There exist external Serdes libraries, e.g. for handling JSON files
	49.5 Users can write custom Serdes for their own data formats
	Note:
		AvroSerde takes care of creating the appropriate Avro schema from the Hive table schema
		

50. File Format Storage
	50.1 Declared with the [STORED AS file_format] DDL clause
	50.2 The following file_format options are supported:
		SEQUENCEFILE is used if data in the file needs to be compressed
		TEXTFILE
		RCFILE
		ORC
		PARQUET
		AVRO

51. File Compression
	Text:		585 GB
	RCFile:		505 GB
	Parquet:	221GB
	ORCFile:	131GB
	
	
52. More on File Formats
	52.1 The STORED BY clause is used to support non-native artifacts, e.g. HBase tables
	52.2 STORED AS INPUTFORMAT ... OUTPUTFORMAT allows users to specify the file format managing Java Class names, e.g.
		org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat
		com.hadoop.mapred.DeprecatedLzoTextInputFormat
		org.apache.hadoop.hive.q1.io.HiveIgnoreKeyTextOutputFormat
	

53. The ORC Data Format
	53.1 The ORC (Optimized Row Columnar)file format provides a highly efficient way to store Hive data
	53.2 ORC has the following advantages:
		Reduced load on the NameNode
		Support for an extended range of types:datetime, decimal and the complex types (struct, list, map and union)
		seeking for a given row in the data set
		(Optional)data compressionn
		Effiient and flexible metadata storage using Google's Protocol Buffers

54. Converting Text to ORC Data Format
	54.1 Let's say you have a text-based Hive table tblTEXT
	54.2 Create a table that has the same schema as tblTEXT, but has the STORED AS ORC qualifier:
		CREATE TABLE tblORC () STORED AS ORC;
	54.3 Insert data into the ORC table from the TEXT table:
		INSERT INTO TABLE tblORC SELECT * FROM tblTEXT;
		The Text-to-ORC conversion will happen automatically
	
	Note:	
		By default, ORC uses the ZLIB compression codec. You can configure compression (or switch it off completely) by using the orc.compress Hive table property, for example:
		STORED AS ORC tblproperties ("orc.compress"="SNAPPY")
		will set the compression alforithm to SNAPPY; using NONE for this configuration value will disable
		
56. Example of Creating managed table
	56.1 Create the internal table and load the data
		CREATE TABLE IF NOT EXISTS employee_internal(
			name string,
			work_place ARRAY<string>,
			gender_age STRUCT<gender:string,age:int>,
			skills_score MAP<string,int>,
			depart_title MAP<STRING,ARRAY<STRING>>
		)
		COMMENT 'This is an internal table'
		ROW FORMAT DELIMITED
		FIELDS TERMINATED BY '|'
		COLLECTION ITEMS TERMINATED BY ','
		MAP KEYS TERMINATED BY ':'
		STORED AS TEXTFILE;
		
		LOAD DATA LOCAL INPATH '/root/TrainingOnHDP/dataset/employee.txt' OVERWRITE INTO TABLE employee_internal;
	
		
55. The EXTERNAL DDL Parameter
	55.1 The EXTERNAL parameter allows users to create a table in location different from the specified by the hive.metastore.warehouse.dir property in the hive-site.xml configuration file
	55.2 The directory for the table is specified by the LOCATION DDL parameter of the CREATE EXTERNAL TABLE statement
	55.3 This option gives an advantage of re-using data already generated in that location and which has data structure that corresponds to the field data types declared by the CREATE statement
		Note:
		When you drop a table created with EXTERNAL parameter, data in the table is not deleted in HDFS (since Hive does not own the data)
	55.4 Tables created without the EXTERNAL parameter are referred to as as Hive-managed table
	
56. Example of Using EXTERNAL
	56.1 If you have an HDFS folder named /user/me/myexternal_folder/ which contains a text file with two tab-delimited columns, you can use the following CREATE EXTERNAL TABLE statement against this
		location:
		
		CREATE EXTERNAL TABLE tblExternal (col1 STRING, col2 STRING)
		COMMENT 'Cteating a table at a specific location'
		ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
		STORED AS TEXTFILE 
		LOCATION '/user/me/myexternal_folder/";
		
		
57. Creating an Empty Table
	57.1 In some cases, you may need to have a table with the structure similar to that of an already existing table
	57.2 Use the following command to copy the table defition:
		CREATE TABLE T2 LIKE T1;
		
		In this case, the T2 table will be created with the same table dfinition used in creating the T1 table
		No records will be copied and T2 will be empty

58. Dropping a Table
	58.1 Use the following command to drop a table
		DROP TABLE [IF EXISTS] table_name
	
	58.2 The DROP TABLE statement removes table's metadata as well as the data in the table
	
	Note:
		The data is moved to the ./Trash/Current directory in the user's home HDFS directory (if trashing is configured);the table metadata in the metadstore can no longer be recovered
		When dropping a table created with the EXTERNAL parameter, the original data fille will not be deleted from the file system
		

59. Table/Partitions Truncation
	59.1 Table/Partitions truncation removes all rows from a table or partitions
	59.2 The command has the following syntax:
		TRUNICATE TABLE table_name [PARTITION partition_spec];
	59.3 User can specify partial partition_spec for trunicating specific partitions	


60. Alter Table/Partition/Column
	60.1 These commands allow user ro have a fine-graned control over the table structure, e.g.
		Changing table name:
			ALTER TABLE table_name RENAME TO new_table_name;
		ADDING a partition:
			ALTER TABLE table_name ADD PARTITION [partCol = 'pc_name']
			location 'path_to_data_location';
		Dropping a partition:
			ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec, PARTITION partition_spec, ...
			
		Changing column:
			ALTER TABLE test_name CHANGE orgi_col_name new_col_name
	60.2 This command will allow users to change a column's name, and (optionally) its data type		

61. Views
	61.1 Hive views are virtual tables with no associated physical storage
	61.2 Views are built as a logical construct on top of physical tables
	61.3 You can run regular queries against views:
		SELECT name FROM myView WHERE phone LIKE '416%'
	61.4 A view's schema is immutable and is defined when the view is created;subsequent changes to underlying tables (e.g. renaming or adding a column) will not be propagated to the view's schema.
		Any subsequent invalid reference to the underlying changed parameters will raise an exception

62. Create View Statement
	62.1 A view is created with the following statement:
		CREATE VIEW [IF NOT EXISTS} view_name [(column_name [COMMENT column_comment], ...)] 
		[COMMENT view_comment]
		AS SELECT ...
	62.2 For example:
		CREATE VIEW vNames (firstName, lastName) AS SELECT fn, ln FROM User;
	62.3 Note:
		If no column names are supplied (they are optional) in the view definition, the names of the view's columns will be derived automatically from the defining AS SELECT expression.
		Column names may be redefined in the view definition

		
63. Why Use Views?
	63.1 Views are useful when you need:
		Restrict viewable data based on some conditions (;limiting columns and rows for security and other considerations)
		Wrap up complex queries
		

64. Restricting Amount of Viewable Data
	64.1 Views help restrict the amount of viewable data by the following techniques:
		Providing a view of the subset of columns in the source table
		Providing a subset of rows matching the WHERE clause
		Using a combination of the above techniques
		Using these techniques, you can hide sensitive information by not declaring them in the view

65. Examples of Restricting Amount of Viewable Data
	65.1 You can restrict th eviewing of the client income (which may be regarded as confidential information) in the client table defined as follows:
			CREATE TABLE client (fn STRING, ln STRING, income FLOAT)...;
		
		by ommitting the income column in the view based on the client table:
			CREATE VIEW publicClientView AS SELECT fn, ln FROM client;
	65.2 Views can also be used to limit the number of returned records by specifying the WHERE matching condition in the AS SELECT clause:
		CREATE VIEW shortView AS
		SELECT * FROM bigTable WHERE <limiting conditions>;
		
		
66. Creating and Dropping Indexes
	66.1 Hive supports column indexing to spped up data querying
		Hive indexing capabilities are limited
	66.2 Indexes are created with the CREATE INDEX... statement
	66.3 The indexed data for a table is stored in another table defined in the CREATE INDEX... statement
	66.4 You can drop a created index by using the DROP INDEX...statement
		DROP INDEX drops the index and deletes the index table
		

67. Describing Data
	67.1 Hive offers the DESCRIBE statement that can be applied to a number of objects: databases, tables, partitions, views, and columns
	67.2 The DESCRIBE statement shows metadata associated with the target object
	67.3 The general syntax of the statement:
		DESCRIBE some_object;
		
		For example:
		DESCRIBE my Table;
		


68. Summary
	68.1 Hive offers an extensive Data Definition Language (DDL) that addresses the most practical user needs, such as:
		Creating and dropping Hive databases
		Creating and dropping tables
		Performing table partitioning for faster data querying
		Creating views to minimize the amount of viewable data either for performance of security considerations

69. Objective
	In this section, participants will learn about Hive's Data Manipulation Language (DML) and its two primary ways of data loading:
		Using the LOAD DATA statement
		Using the INSERT statement

70. Hive Data Manipulation Language (DML)
	70.1 The Hive Data Manipulation Language (DML) deals with loading data in a table or partition
	70.2 There are three primary ways to load data in Hive:
		Using the LOAD statement
		Using the INSERT statement
		Using the INSERT...VALUES statement(Hive apparently supports INSERT...VALUES starting in Hive 0.14)

71. Using the LOAD DATA statement
	71.1 The LOAD DATA statement performs the bulk load operation and has the following syntax:
		LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1,partcol2=val2 ...)]
	71.2 If the keyword LOCAL is specified, then the contents of the file specified by the filepath parameter is loaded (copied over from the OS file system)
		into the target file system
		The filepath parameter can be a relative or an absolute path to the file
	71.3 If the keyword LOCAL is not specified, then Hive will appy some file location algorithm and move the file specified by the filepath parameter int Hive
	71.4 If the OVERWRITE keyword is present, then the contents of the target table (or partition) will be overwritten by the file referred to by filepath
	71.5 If the OVERWRITE keyword is not present, the source file's content is added to the target table
	

72. Example of Loading Data into a Hive Table
	LOAD DATA LOCAL INPATH '/root/TrainingOnHDP/dataset/employee.txt' OVERWRITE INTO TABLE employee;
	

73. Loading Data with the INSERT Statement
	73.1 The INSERT statement allows inserting data from SELECT-based queries
	73.2 The INSERT statement has two variants:
		INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...)]
		select_statements1 FROM from_statement;
		
		INSERT TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...)]
		select_statements1 FROM from_statement;
		
74. Appending and Replacing Data with the INSERT Statement
	74.1 The INSERT OVERWRITE variant will overwrite any existing data in the table or partition
	74.2 The INSERT INTO variant will append to the table or partition
	74.3 Data inserts with the INSERT statement can be done to a table or a partition

75. Examples of Using the INSERT Statement
	INSERT INTO TABLE Q1 SELECT sales FROM JanSales;
	INSERT INTO TABLE Q1 SELECT sales FROM FebSales;
	INSERT INTO TABLE Q1 SELECT sales FROM MarSales;
	
	The above three INSERT statements will append data from three different tables to the Q1 table
	
76. Multi Table Inserts
	76.1 Multi Table Inserts is an optimization technique that helps minimize the number of data scans on the source table
		The source data is scanned only once and it is then re-used as input for building different queries which result sets for multiple INSERT statements
		

77. Multi Table Inserts Syntax
	77.1 Multiple Table Inserts work for both data appending and data overwriting
		FROM from_statement
		INSERT OVERWRITE TABLE tablename1 [PARTITION(partcol1=val1, partcol2=val2 ...)] select_statement1
		[INSERT OVERWRITE TABLE tablename2 [PARTITION(partcol1=val1, partcol2=val2 ...)] select_statement2] ...;
		
		Note:
		The Multi Table statement starts with the FROM keyword and ends with the ";"
	77.2 The FROM from_statement clause a full scan of the source data	
		The from_statement can be the name of a Hive table or a JOIN statement
		The scanned result returned by the FROM from_statement clause can now be re-used for multiple table inserts


78. Multi Table Inserts Example
	FROM sourceTable sales
	INSERT OVERWRITE TABLE destTable1 SELECT s.col1 WHERE s.col1 != 999
	INSERT OVERWRITE TABLE destTable2 SELECT s.col1 WHERE s.col2 < 0
	
	Note:
	There is no FROM clause in the INSERT statements that follow the FROM clause
	

79. Summary
	79.1 Hive supports data loading into tables and partitions using two statements:
		LOAD DTAA
		INSERT

80. Objectives
	In this section, participants will learn about:
		Hive Query Language (HiveQL)
		SELECT and related statements
		HiveQL built-in functions

81. HiveQL
	81.1 For data query, Hive uses an SQL-like language called HiveQL
	81.2 While many of the language structs resemble SQL-92, HiveQL does not claim full standard compliance
	81.3 HiveQL offers Hive-specific extensions that help leverage Hive internal architecture
	81.4 Hive offers only basic support for indexes, so, in most cases, a full table scan is performed when a query is run against a table
	
82. The SELECT Statement Syntax
	82.1 The center-piece of HiveQL is the SELECT statement which has the following syntax:
	
		SELECT [ALL|DISTINCT] select_expr1, select_expr2,...
		FROM table_reference
		[WHERE where_condition]
		[GROUP BY col_list]
		[HAVING having_condition]
		[LIMIT number]
		
		

83. The WHERE Clause
	83.1 Hive closely models the SQL's WHERE clause syntax
	83.2 As of Hive 0.13, some types of subqueries are supported in the WHERE clause
	83.3 The WHERE clause supports a number of Hive operator and user-defined functions that evaluate to a boolean result

84. Examples of the WHERE Statement
	84.1 For example:
	
	SELECT ... FROM ... WHERE X <= Y AND X != Z;
	SELECT ... FROM ... WHERE round(X) = Y;
	
	Using Hive's round() built-in functionally
	SELECT ... FROM ... WHERE X IS NULL;
	SELECT ... FROM ... WHERE X IS NOT NULL;
	SELECT ... FROM ... WHERE X LIKE 'A%';
	
	Finding all rows where values in the X column start with 'A'
	SELECT ... FROM ... WHERE X LIKE '%Z'
	
	Finding all rows where values in the X column end with 'Z'
	
	Notes:
	
	Through its WHERE extension, the RLIKE clause, Hive also allow users to use more powerful java regular expressions for finding matches
	
	For example, SELECT ... FROM ... WHERE name RLIKE '.*(Bill|William).*'
	
	

85. Partition-based Queries
	85.1 Generally, HiveQL's SELECT query performs a full table scan (a highly inefficient process) when finding the matching rows
	85.2 Hive offers a query optimization technique bases on table partitioning
	85.3 Partitions are column-based and are created using the PARTITIONED BY clause of the CREATE TABLE statement
	85.4 SELECT statements that take advantage of existing table partitions help reduce amount of data to be scanned
	
86. Example of an Efficient SELECT Statement
	86.1 Provided that the stats table is partitioned by the date field, the following
		select statement will spped up queries by limiting the amount of data to be scanned (it is assumed that there are data partitioning directories named
		2014-01-01, 2014-01-02,...,2014-02-01, 2014-02-02, ...2014-12-31):
		
		SELECT * FROM stats WHERE date > '2014-05-01' AND date <= '2014-09-01'

		
87. The DISTINCT Clause
	87.1 The DISTINCT clause removes duplicaet rows in the output
		The defalt is to return all the matching records (the ALL clause)
		SELECT f1, f2 FROM Features
		101	303
		101	303
		101	404
		202	505
		Note:
		The default ALL clause is applied
		
		SELECT DISTINCT f1, f2 FROM Features
		101, 303
		101, 404
		202, 505
		
		Note:
		One 101, 303 row is dropped as the row duplication
		filter is applied across all columns
		
		SELECT DISTINCT f1 FROM Features
		101
		202
		

88. Supported Numeric Operators
	88.1 HiveQL supports the standard set of arithmetic operations on numeric types in the SELECT statement
	
		Operators					Description
		A+B							Addition of A and B
		A-B							Substraction of B from A
		A*B							Multiplication of A and B`
		A/B							Division of A with B`
		A%B							The remainder of dividing A with B`
		A&B							Bitwise AND of A and B
		A|B							Bitwise OR of A and B
		A^B							Bitwise XOR of A and B`
		~A							Bitwise NOT of A

	88.2 Arithmetic operators take any numeric type
	88.3 No type coercion (casting) is performed if both operands are of the same numeric type
		Otherwise, the value of the smaller of the two types is promoted to the wider type (with more allocated bytes) of the other operand
		
89. Built-in Mathematical Functions
	89.1 HiveQL supports the usual set of mathematical functions in its SELECT statement	
		round(), floow(), ceil(), exp(), log(), sqrt(), sin(), cos() etc.
	89.2 Most mathematical functions return a DOUBLE or a NULL if a NULL parameter is passed to the functionally
	


90. Built-in Aggregate Functions
	AVG(col)				Return the average of the values in column col
	AVG(DISTINCT col)		Return the average of the distinct values in column col
	COUNT(*)				Return the total number of retrieved rows, including rows containing NULL values
	SUM(col)				Return the sum of the values in column col
	SUM(DISTINCT col)		Return the sum of the distinct values in column col
	MAX(col)				Reutnn the maximum value in column col
	MIN(col)				Return the minimum value in column col
	
	Note:
	All functions return their results as a DOUBLE except for COUNT() which returns a BIGINT
	

91. Built-in Statistical Functions
	Function					Description
	CORR(col1,col2)				Return the correlation of two sets of numbers
	COVAR_POP(col1,col2)		Return the covariance of a set of numbers
	STDDEV_POP(col)				Return the standard deviation of a set of numbers
	
	Note:
	All statistical functions return result as a DOUBLE
	For finding the mean of a population, use the AVG() function
	
	
92. Other Useful Built-in Functions
	Function				Return Type					Description
	INSTR(str,substr)		INT 						Return the index of str where substr is found
	LENGTH(s)				INT 						Return the length of the string
	REPEAT(str,n)			STRING 						Repeat str n times
	SPACE(n)				STRING 						Returns n spaces
	YEAR(timestamp)			INT 						Return the year part as an INT of a timestamp string, e.g., year("2016-10-03 00:00:00") return 2016
	MONTH(timestamp)		INT 						Return the month part as an INT of a timestamp string, e.g., year("2016-10-03 00:00:00") return 10
	DAY(timestamp)			INT 						Return the day part as an INT of a timestamp string, e.g., year("2016-10-03 00:00:00") return 3
	TO_DATE(tinmestamp)		STRING  					Return the date part of a timestamp string, e.g. to_date("2016-10-03 00:00:00") returns "2016-10-03"
	col IN (val1, val2, ...))	BOOLEAN 				Return true if col equals one of the values in the list (val1, val2, ...) false otherwise
	
	
93. The GROUP BY Clause
	93.1 The GROUP BY statement is normally used in conjuction with the aggregate functions to group the result-set by one or more columns
	93.2 The SELECT statement performs an aggregation over each group
		SELECT sales_month, SUM(sales) FROM sales_2014
		WHERE city = 'Toronto' AND price >= '1000'
		GROUP BY sales_month;
	

94. The HAVING Clause
	94.1 The HAVING clauses was added to HiveQL in ver 0.7 because the WHERE keyword can not be used woth aggregate functions
	94.2 The following SELECT statement will list all months' total sales in Toronto where the total sales per month were in excecss of 1 million dollars
	
		SELECT sales_month, SUM(sales) FROM sales
		WHERE city = 'Toronto'
		GROUP BY sales_month
		HAVING SUM (sales) > 1000000;
		
95. The LIMIT Clause
	95.1 The LIMIT clause sets the number of rows to be returned
		Note:
			The rows that will be returned are randomly chosen, Repeating this command may yield different results (different rows may be returned)
			SELECT * FROm my Table LIMIT 3
			

96. The ORDER BY Clause
	96.1 The ORDER BY syntax in HiveQL is similar to that of ORDER BY in SQL language
	96.2 The ORDER BY supports ascending and descending ordering represented by the ASC (default) and DESC keywords respectively, e.g.
		SELECT * FROM myTable ORDER BY col1 DESC;
		

97. The JOIN Clause
	97.1 HiveQL supports the SQL-like JOIN clause which is used to combine rows from two or more tables on values from a common column
	97.2 Only equality joins are allowed where the equak condition is used (t1.col1 = t2.col2)
	97.3 Examples:
		SELECT A.* FROM A JOIN B ON (A.col1 = B.col2)
	97.4 More than 1 tables can be joined in the same query, e.g.:
		SELECT A.col1, B.col2, C.col3
		FROM A JOIN B ON (A.id1 = B.id2)
		JOIN C ON (C.id3 = B.id4)
		
98. The CASE ... Clause
	98.1 HiveQL supports the if-like CASE ... WHEN ... ELSE combined statement which has the following syntax:
		SELECT col1, ...,
		CASE WHEN <condition A> THEN <label 1>
		[WHEN <condition B> THEN <label 2>, WHEN ...]
		ELSE <label N>
		END AS <pseudo_column_name>
		FROM table_name;
		

99. Example of CASE ... Clause
	SELECT location, datetime, 
	CASE
	WHEN spedMPH > 40 AND sppedMPH < 73 THEN 'Light'
	WHEN spedMPH >= 73 AND sppedMPH < 113 THEN 'Moderate'
	WHEN spedMPH >= 113 AND sppedMPH < 158 THEN 'Considerable'
	WHEN spedMPH >=158 THEN 'Severe'
	ELSE 'Unknown' END AS category FROM Tornadoes;
	
	


100. Summary
	100.1 HiveQL offers SQL-like SELECT statement with the related WHERE, GROUP BY and HAVING clauses
	100.2 HiveQL supports the common set of
		Numeric Operators (+/-,%,etc.)
		Built-in Aggregate Functions (AVG(),SUM(),COUNT() etc.)
		Assorted functions (INSTR(),LENGTH(),YEAR() etc.)
	100.3 In addition, HiveQL has a number of built-in statistical functions (CORR(), STDDEV_POP() etc)
	100.4 The CASE ... WHEN ... THEN ... ELSE combined statement is also supported



	
Some tips:

1. Order and Sort

	ORDER BY (ASC|DESC): 
	
	This is similar to the RDBMS ORDER BY statement. A sorted order is maintained across all of the output from every reducer. It performs the global
	sort using only one reducer, so it takes a longer time to return the result. Usage with LIMIT is strongly recommended for ORDER BY. When hive.mapred.mode = strict
	(by default, hive.mapred.mode = nonstrict) is set and we do not specify LIMIT, there are exceptions. This can be used as follows:
		SELECT name FROM employee ORDER BY NAME DESC;
		
	SORT BY (ASC|DESC): 
	
	This indicates which columns to sort when ordering the reducer input records. This means it completes sorting before sending data to the
	reducer. The SORT BY statement does not perform a global sort and only makes sure data is locally sorted in each reducer unless we set mapred.reduce.tasks=1. In this
	case, it is equal to the result of ORDER BY. It can be used as follows:
	
	--Use more than 1 reducer
	SET mapred.reduce.tasks = 2;
	SELECT name FROM employee SORT BY NAME DESC;	
	
	DISTRIBUTE BY: 
	
	Rows with matching column values will be partitioned to the same
	reducer. When used alone, it does not guarantee sorted input to the reducer. The
	DISTRIBUTE BY statement is similar to GROUP BY in RDBMS in terms of deciding
	which reducer to distribute the mapper output to. When using with SORT BY,
	DISTRIBUTE BY must be specified before the SORT BY statement. And, the column
	used to distribute must appear in the select column list. It can be used as follows:

		SELECT name, employee_id FROM employee_hr DISTRIBUTE BY employee_id;
		
		SELECT name, employee_id FROM employee_hr DISTRIBUTE BY employee_id SORT BY name;	
		
	CLUSTER BY: 
	
	This is a shorthand operator to perform DISTRIBUTE BY and SORT BY
	operations on the same group of columns. And, it is sorted locally in each reducer.
	The CLUSTER BY statement does not support ASC or DESC yet. Compared to ORDER BY,
	which is globally sorted, the CLUSTER BY operation is sorted in each distributed
	group. To fully utilize all the available reducers when doing a global sort, we can do
	CLUSTER BY first and then ORDER BY. This can be used as follows:
		
		SELECT name, employee_id FROM employee_hr CLUSTER BY name;	

		
		
1. Does Hive have something equivalent to DUAL?
		No,
		Workaround:
		We can use existing table to achieve dual functionality by following query
		
			select 1, 'name', array('str1','str2'), map('key',array(1,2)) from employee limit 1;


2. How to frame the contents of my local file to support HIVE nested ARRAY in MAP data type

	Hive's default delimiters are:

	Row Delimiter => Control-A ('\001')
	Collection Item Delimiter => Control-B ('\002')
	Map Key Delimiter => Control-C ('\003')
	
	If you override these delimiters then overridden delimiters are used during parsing. The preceding description of delimiters is correct for the usual case of flat data structures, 
	where the complex types only contain primitive types. For nested types the level of the nesting determines the delimiter.

	For an array of arrays, for example, the delimiters for the outer array are Control-B ('\002') characters, as expected, but for the inner array they are Control-C ('\003') characters, the next delimiter in the list.

	For nested types, for example, the depart_title column in the preceding tables, the level of nesting determines the delimiter. Using ARRAY of ARRAY as an example, the delimiters
	for the outer ARRAY are Ctrl + B (\002) characters, as expected, but for the inner ARRAY they are Ctrl + C (\003) characters, the next delimiter in the list. For our example of using MAP
	of ARRAY, the MAP key delimiter is \003, and the ARRAY delimiter is Ctrl + D or ^D (\004).	

	Hive actually supports eight levels of delimiters, corresponding to ASCII codes 1, 2, ... 8, but you can only override the first three.


	So you can write your input file as following format:

	1|JOHN|abu1/abu2|key1:1'\004'2'\004'3/key2:6'\004'7'\004'8
	
	Output of SELECT * FROM test_stg; will be:

	1       JOHN     ["abu1","abu2"]     {"key1":[1,2,3],"key2":[6,7,8]}

	Quick workaround:

	1. create table test_table as select 1, 'name', array('str1','str2'), map('key',array(1,2)) from employee limit 1;

	$ hdfs dfs -copyToLocal /apps/hive/warehouse/test_table/000000_0 test_table

	$ vi test_table

	1^Aname^Astr1^Bstr2^Akey^C1^D2



CREATE TABLE employee(
			name string		COMMENT 'name', 
			work_place ARRAY<string>	COMMENT 'Work Place',
			gender_age STRUCT<gender:string,age:int>	COMMENT 'Gender and Age',
			skills_score MAP<string,int>	COMMENT 'Skill Score',
			depart_title MAP<string,ARRAY<string>>	COMMENT 'Department and Title')
		ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
		COLLECTION ITEMS TERMINATED BY ','
		MAP KEYS TERMINATED BY ':';		
		
LOAD DATA INPATH '/user/root/exmployee1.txt' OVERWRITE INTO TABLE employee;	

exmployee1.txt		


Advanced Hive Features

1. Managed Table

CREATE TABLE employee_test(
			name string,
			age int
			);

drop table employee_test;
			
CREATE TABLE employee_test_partition(
			name string,
			age int)
partitioned by (edate string);

insert into table employee_test_partition partition (edate) values("peter", 30, "20161001") ;
insert into table employee_test_partition partition (edate) values("john", 50, "20161003") ;

ALTER TABLE employee_test_partition PARTITION(edate="20161001") RENAME TO PARTITION(edate="20161011");
ALTER TABLE employee_test_partition DROP IF EXISTS PARTITION(edate="20161011");
LOAD DATA INPATH '/user/root/employee_test_partition.txt' INTO TABLE employee_test_partition PARTITION (edate="20161101");	
ALTER TABLE employee_test_partition ADD PARTITION (edate="20171101") location "/user/root/20171101"; (the files are not moved into /apps/user/warehouse for managed table, change the file will immediately reflect in the table);



hadoop fs -mv /apps/hive/warehouse/employee_test_partition/edate=20161001 /apps/hive/warehouse/employee_test_partition/emdate=20161001
hadoop fs -mv /apps/hive/warehouse/employee_test_partition/edate=20161003 /apps/hive/warehouse/employee_test_partition/emdate=20161003
ALTER TABLE employee_test_partition DROP IF EXISTS PARTITION(edate="20161001");
ALTER TABLE employee_test_partition DROP IF EXISTS PARTITION(edate="20161003");
update hive.PARTITION_KEYS set PKEY_NAME = "emdate" where TBL_ID = 18;
ALTER TABLE employee_test_partition ADD PARTITION (emdate="20161001") location "/apps/hive/warehouse/employee_test_partition/emdate=20161001";
ALTER TABLE employee_test_partition ADD PARTITION (emdate="20161003") location "/apps/hive/warehouse/employee_test_partition/emdate=20161003";


CREATE TABLE employee_test(
			name string,
			age int
			)
location "/user/root/20171101"; (the files are not moved into /apps/user/warehouse for managed table, change the file will immediately reflect in the table);			


2. Dynamic Partition

SET hive.exec.dynamic.partition.mode;
SET hive.exec.dynamic.partition=true;

Dynamic partition insert could potentially be a resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters:

hive.exec.max.dynamic.partitions.pernode (default value being 2000) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed.
hive.exec.max.dynamic.partitions (default value being 5000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination.
hive.exec.max.created.files (default value being 5000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed.



3. External Table

CREATE EXTERNAL TABLE employee_external_test(
			name string,
			age int
)
partitioned by (edate string);			

insert into table employee_external_test partition (edate) values("peter", 30, "20161001") ; (data is written to /apps/user/warehouse since external table doesn't define location)
insert into table employee_external_test partition (edate) select name, age, emdate from employee_test_partition;	
insert into table employee_external_test partition (edate) select * from employee_test_partition;	

UPDATE tablename SET column = value [, column = value…] [WHERE expression]
DELETE FROM tablename [WHERE expression]


hadoop fs -put '/root/TrainingOnHDP/dataset/employee_test_partition.txt' /user/root/employee_test_partition.txt
LOAD DATA INPATH '/user/root/employee_test_partition.txt' INTO TABLE employee_external_test PARTITION (edate="20181101"); (the files are removed even ig it is external table)

hadoop fs -mkdir /apps/hive/warehouse/employee_external_test/edate=20191001 
hadoop fs -put '/root/TrainingOnHDP/dataset/employee_test_partition.txt' /apps/hive/warehouse/employee_external_test/edate=20191001
MSCK REPAIR TABLE employee_external_test;


hadoop fs -mkdir /user/root/data 
hadoop fs -mkdir /user/root/data/edate=20160101 
hadoop fs -put '/root/TrainingOnHDP/dataset/employee_test_partition.txt' /user/root/data/20160101/employee_test_partition.txt


hadoop fs -mkdir /user/root/data/edate=20170101 
hadoop fs -put '/root/TrainingOnHDP/dataset/employee_test.txt' /user/root/data/edate=20170101



CREATE EXTERNAL TABLE employee_external_test(
			name string,
			age int
)
partitioned by (edate string)
location "/user/root/data";

MSCK REPAIR TABLE employee_external_test;

insert into table employee_external_test1 partition (edate) values("peter", 30, "20160109"); 

hadoop fs -put '/root/TrainingOnHDP/dataset/employee_test.txt' /user/root/data/edate=20160101

4. Aggregation and Sampling

4.1 Prepare table and data for demonstration

CREATE TABLE IF NOT EXISTS employee_contract(
name string,
dept_num int,
employee_id int,
salary int,
type string,
start_date date
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH
'/root/TrainingOnHDP/dataset/employee_contract.txt' 
OVERWRITE INTO TABLE employee_contract;

4.2 The regular aggregations are used as analytic functions

SELECT name, dept_num, salary,
COUNT(*) OVER (PARTITION BY dept_num) AS row_cnt,
SUM(salary) OVER(PARTITION BY dept_num ORDER BY dept_num) AS deptTotal,
SUM(salary) OVER(ORDER BY dept_num) AS runningTotal1,
SUM(salary) OVER(ORDER BY dept_num, name rows unbounded 
preceding) AS runningTotal2
FROM employee_contract
ORDER BY dept_num, name;

4.3 Other analytic functions

SELECT name, dept_num, salary,
RANK() OVER (PARTITION BY dept_num ORDER BY salary) AS rank, 
DENSE_RANK() OVER (PARTITION BY dept_num ORDER BY salary) 
AS dense_rank,
ROW_NUMBER() OVER () AS row_num,
ROUND((CUME_DIST() OVER (PARTITION BY dept_num 
ORDER BY salary)), 1) AS cume_dist,
PERCENT_RANK() OVER(PARTITION BY dept_num 
ORDER BY salary) AS percent_rank,
NTILE(4) OVER(PARTITION BY dept_num ORDER BY salary) AS ntile
FROM employee_contract
ORDER BY dept_num;

SELECT name, dept_num, salary,
LEAD(salary, 2) OVER(PARTITION BY dept_num 
ORDER BY salary) AS lead,
LAG(salary, 2, 0) OVER(PARTITION BY dept_num 
ORDER BY salary) AS lag,
FIRST_VALUE(salary) OVER (PARTITION BY dept_num 
ORDER BY salary) AS first_value,
LAST_VALUE(salary) OVER (PARTITION BY dept_num 
ORDER BY salary) AS last_value_default,
LAST_VALUE(salary) OVER (PARTITION BY dept_num 
ORDER BY salary 
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
AS last_value
FROM employee_contract ORDER BY dept_num;

SELECT name, dept_num, salary AS sal,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY
name ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) win1,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 2 PRECEDING AND UNBOUNDED FOLLOWING) win2,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING) win3,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN 1 FOLLOWING AND 2 FOLLOWING) win5,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN CURRENT ROW AND CURRENT ROW) win7,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) win8,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) win9,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) win10,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) win11,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING) win12,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
name ROWS 2 PRECEDING) win13
FROM employee_contract
ORDER BY dept_num, name;

SELECT name, dept_num, salary,
MAX(salary) OVER w1 AS win1,
MAX(salary) OVER w1 AS win2,
MAX(salary) OVER w1 AS win3
FROM employee_contract
ORDER BY dept_num, name
WINDOW
w1 AS (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 
2 PRECEDING AND CURRENT ROW),
w2 AS w3,
w3 AS (PARTITION BY dept_num ORDER BY name ROWS BETWEEN 
1 PRECEDING AND 2 FOLLOWING);

SELECT name, salary, start_year,
MAX(salary) OVER (PARTITION BY dept_num ORDER BY 
start_year RANGE BETWEEN 2 PRECEDING AND CURRENT ROW) win1
FROM
(
  SELECT name, salary, dept_num, 
  YEAR(start_date) AS start_year
  FROM employee_contract
) a;


4.4 Bucket table sampling example

CREATE TABLE employee_id_buckets                         
(
  name string,
  employee_id int,
  work_place ARRAY<string>,
  sex_age STRUCT<sex:string,age:int>,
  skills_score MAP<string,int>,
  depart_title MAP<STRING,ARRAY<STRING>>
)
CLUSTERED BY (employee_id) INTO 2 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';

set map.reduce.tasks = 2;

set hive.enforce.bucketing = true;

INSERT OVERWRITE TABLE employee_id_buckets SELECT * FROM employee_id;

Run into OutOfMemory error since using Tez engine

https://azure.microsoft.com/en-us/blog/hive-memory-settings-resolve-out-of-memory-errors-using-azure-hdinsight/

the following two memory settings define the container memory for the heap: hive.tez.container.size and hive.tez.java.opts. 
From my experience, the OOM exception does not mean the container size is too small. It means the Java heap size (hive.tez.java.opts) 
is too small. So whenever you see OOM, you can try to increase “hive.tez.java.opts.” If needed you might have to increase “hive.tez.container.size.” 
The “java.opts” should be around 80% of “container.size.”

# SET hive.tez.container.size=10240
SET hive.tez.java.opts=-Xmx1024m

SELECT name FROM employee_id_buckets TABLESAMPLE(BUCKET 1 OUT OF 2 ON rand()) a;

4.5 Block sampling - Sample by rows

SELECT name FROM employee_id_buckets TABLESAMPLE(4 ROWS) a;

4.6 Sample by percentage of data size

SELECT name FROM employee_id_buckets TABLESAMPLE(10 PERCENT) a;

4.7 Sample by data size

SELECT name FROM employee_id_buckets TABLESAMPLE(3M) a;   

# does NOT WORK, NEED TO IMPORT HASH_MD5 LIB FROM BRICK
#select * from employee_id where abs( hash_md5(employee_id) ) % 100 < 10;

5. Hive and Machine Learning

Add the following two lines to your $HOME/.hiverc file.

add jar /root/TrainingOnHDP/lib/hivemall-core-0.4.2-rc.2-with-dependencies.jar;
source /root/TrainingOnHDP/lib/define-all.hive;

This automatically loads all Hivemall functions every time you start a Hive session. Alternatively, you can run the following command each time.

awk -f /root/TrainingOnHDP/dataset/ml/conv.awk /root/TrainingOnHDP/dataset/ml/kdd10a/kdda | hadoop fs -put - /dataset/kdd10a/train/kdda
awk -f /root/TrainingOnHDP/dataset/ml/conv.awk /root/TrainingOnHDP/dataset/ml/kdd10a/kdda.t | hadoop fs -put - /dataset/kdd10a/test/kdda.t

create database kdd2010;
use kdd2010;

create external table kdd10a_train (
  rowid int,
  label int,
  features ARRAY<STRING>
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY "," 
STORED AS TEXTFILE LOCATION '/dataset/kdd10a/train';

create external table kdd10a_test (
  rowid int, 
  label int,
  features ARRAY<STRING>
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY "," 
STORED AS TEXTFILE LOCATION '/dataset/kdd10a/test';




$ hive
add jar /root/TrainingOnHDP/lib/hivemall-core-0.4.2-rc.2-with-dependencies.jar;
source /root/TrainingOnHDP/lib/define-all.hive;

cat /root/.hiverc

	5.1 dual
	
		create table dual (dummy int);

		INSERT INTO TABLE dual SELECT count(*)+1 FROM dual;

	5.1 Feature Vectorizer
	
		array<string> vectorize_feature(array<string> featureNames, ...) is useful to generate a feature vector for each row, from a table.
		
		select vectorize_features(array("a","b"),"0.2","0.3") from employee limit 1;
		>["a:0.2","b:0.3"]

		-- avoid zero weight
		select vectorize_features(array("a","b"),"0.2",0) from employee limit 1;
		> ["a:0.2"]

		-- true boolean value is treated as 1.0 (a categorical value w/ its column name)
		select vectorize_features(array("a","b","bool"),0.2,0.3,true) from employee limit 1;
		> ["a:0.2","b:0.3","bool:1.0"]
	

6. Hive Integration with other tools
	6.1 Hcatalog
	
	create table drivers
		(driverId int,
		name string,
		ssn bigint,
		location string,
		certified string,
		wageplan string)
	ROW FORMAT DELIMITED
	FIELDS TERMINATED BY ','
	STORED AS TEXTFILE
	TBLPROPERTIES("skip.header.line.count"="1");
	
	LOAD DATA LOCAL INPATH '/root/TrainingOnHDP/dataset/drivers.csv' OVERWRITE INTO TABLE drivers;
	
	
	create table truck_events
		(driverId int,
		truckId int,
		eventTime string,
		eventType string,
		longitude double,
		latitude double,
		eventKey string,
		correlationId bigint,
		driverName string,
		routeId int,
		routeName string)
	ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
	STORED AS TEXTFILE
	TBLPROPERTIES("skip.header.line.count"="1");
	
	LOAD DATA LOCAL INPATH '/root/TrainingOnHDP/dataset/truck_event_text_partition.csv' OVERWRITE INTO TABLE truck_events;
	
	select a.driverId,a.driverName,a.eventType,b.certified from truck_events a join drivers b ON (a.driverId = b.driverId);
	
	PIG:
	
	a = LOAD 'drivers' using org.apache.hive.hcatalog.pig.HCatLoader();
	b = LOAD 'truck_events' using org.apache.hive.hcatalog.pig.HCatLoader();
	c = join b by driverid, a by driverid;
	dump c;
	
	WebHCat
	
	http://127.0.0.1:50111/templeton/v1/ddl/database/default/table/employee?user.name=hive
	
	
	
	6.2 Oozie
	6.3 HBase
	
		6.3.1 
		CREATE TABLE IF NOT EXISTS pagecounts (projectcode STRING, pagename STRING, pageviews STRING, bytes STRING)
		ROW FORMAT
		DELIMITED FIELDS TERMINATED BY ' '
		LINES TERMINATED BY '\n'
		STORED AS TEXTFILE
		LOCATION '/user/root/pagecounts';
		
		select INPUT__FILE__NAME from pagecounts limit 10;
		
		CREATE VIEW IF NOT EXISTS pgc (rowkey, pageviews, bytes) AS
		SELECT concat_ws('/',
				projectcode,
				concat_ws('/',
				pagename,
				regexp_extract(INPUT__FILE__NAME, 'pagecounts-(\\d{8}-\\d{6})', 1))),
				pageviews, bytes
		FROM pagecounts;
		
		CREATE TABLE IF NOT EXISTS pagecounts_hbase (rowkey STRING, pageviews STRING, bytes STRING)
		STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
		WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,0:PAGEVIEWS,0:BYTES')
		TBLPROPERTIES ('hbase.table.name' = 'PAGECOUNTS');
		
		FROM pgc INSERT INTO TABLE pagecounts_hbase SELECT pgc.* WHERE rowkey LIKE 'en/q%' LIMIT 10;
		
		hbase shell
		
		scan 'PAGECOUNTS'
		
		CREATE VIEW "PAGECOUNTS" (pk VARCHAR PRIMARY KEY,
		"0".PAGEVIEWS VARCHAR,
		"0".BYTES VARCHAR)
		
	6.4 Zeppline
	
	http://127.0.0.1:9995/
	
	%hive(default)

	select * from employee
	
	show tables
	
	6.5 Tableau
	6.6 Talend Open Studio
	6.7 Datameer
	6.8 Excel
	6.9 Qlikview
	6.10 Sqoop
	
		sqoop import --hive-import --hive-overwrite --connect jdbc:mysql://localhost/sqoop_test --table stocks --fetch-size 10 --username hip_sqoop_user -P
		sqoop import --hive-import --hive-overwrite --hive-partition-key edate --hive-partition-value "20160101" --target-dir /user/root/stocks/edate=20160101 --connect jdbc:mysql://localhost/sqoop_test --table stocks --fetch-size 10 --username hip_sqoop_user -P
		sqoop import --hive-import --hive-partition-key edate --hive-partition-value "20160102" --target-dir /user/root/stocks/edate=20160102 --connect jdbc:mysql://localhost/sqoop_test --table stocks --fetch-size 10 --username hip_sqoop_user -P
		sqoop import --hive-import --hive-partition-key edate --hive-partition-value "20160103" --target-dir /user/root/stocks/edate=20160103 --connect jdbc:mysql://localhost/sqoop_test --table stocks --fetch-size 10 --username hip_sqoop_user -P

	6.11 MySQL (MySQLHandler and Select AS)
	
	add jar /root/TrainingOnHDP/lib/hive-jdbc-handler-0.8.1-wso2v7.jar;
	
	CREATE EXTERNAL TABLE business
	ROW FORMAT SERDE 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCDataSerDe'
	with serdeproperties (
		"escaped" = "true"
	)
	STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
	TBLPROPERTIES (
		"mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
		"mapred.jdbc.url"="jdbc:mysql://localhost:3306/test",
		"mapred.jdbc.username"="hive",
		"mapred.jdbc.password"="root",
		"mapred.jdbc.input.table.name"="business",
		"mapred.jdbc.output.table.name"="business"	
    );
	

	CREATE EXTERNAL TABLE PhonebrandTable(brand STRING,totalOrders INT, totalQuantity INT)
	STORED BY 'org.wso2.carbon.hadoop.hive.jdbc.storage.JDBCStorageHandler'
	TBLPROPERTIES (
		"mapred.jdbc.driver.class"="com.mysql.jdbc.Driver",
		"mapred.jdbc.url"="jdbc:mysql://localhost:3306/test",
		"mapred.jdbc.username"="root",
		"mapred.jdbc.password"="",
		"hive.jdbc.update.on.duplicate" = "true",
		"hive.jdbc.primary.key.fields" = "brand",
		"hive.jdbc.table.create.query" = "CREATE TABLE brandSummary (brand VARCHAR(100) NOT NULL PRIMARY KEY, totalOrders INT, totalQuantity INT)"
	);	
	
	6.12 Spark
		val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
		sqlContext.sql("FROM employee SELECT *").collect().foreach(println)
		sqlContext.sql("FROM employee SELECT count(*)").collect().foreach(println)
		sqlContext.sql("FROM employee_external_test1 SELECT *").collect().foreach(println)
		sqlContext.sql("MSCK REPAIR TABLE employee_external_test1")
	
	6.13 kylin
	
	6.14 Zookeeper
	
	
	
7. Hive Security




8. Hive and Data Governance

9. Hive Performance

10. Hive Data Transfer between Clusters

When	working	with	Hive,	sometimes	we	need	to	migrate	data	among	different environments.	Or	we	may	need	to	back	up	
some	data.	Since	Hive	0.8.0,	EXPORT	and IMPORT	statements	are	available	to	support	the	import	and	export	of	data	in	HDFS	
for	data migration	or	backup/restore	purposes. The	
EXPORT	statement	will	export	both	data	and	metadata	from	a	table	or	partition. Metadata	is	exported	in	a	file	
called	_metadata.	Data	is	exported	in	a	subdirectory	called data: 

EXPORT	TABLE	stocks	TO	'/user/root/stocks1'; 


After	EXPORT,	we	can	manually	copy	the	exported	files	to	other	Hive	instances	or	use Hadoop	distcp	commands	
to	copy	to	other	HDFS	clusters.	Then,	we	can	import	the	data in	the	following	manner: Import	data	to	a	table	with	
the	same	name.	It	throws	an	error	if	the	table	exists: 

IMPORT	FROM	'/user/root/stocks1'; 

Import	data	to	a	new	table: 

IMPORT	TABLE	stocks_imported	FROM '/user/root/stocks1'; 

Import	data	to	an	external	table,	where	the	LOCATION	property	is	optional: 

IMPORT	EXTERNAL	TABLE	stocks_imported_external	FROM	'/user/root/stocks1'
LOCATION	'/user/root/stocks3'; 


Export	and	import	partitions: 

EXPORT	TABLE	stocks	partition(edate=20160101)	TO	'/user/root/stocks7'; 


11. Extending Hive

	Although Hive has many built-in functions, users sometimes will need power beyond that
	provided by built-in functions. For these instances, Hive offers the following three main
	areas where its functionalities can be extended:

	11.1 UDF
	
	These are regular user-defined functions that operate row-wise and output one
	result for one row, such as most built-in mathematic and string functions.
	
	
	11.2 UDAF
	
	These are user-defined aggregating functions that operate row-wise or
	group-wise and output one row or one row for each group as a result, such as the MAX
	and COUNT built-in functions.
	
	11.3 UDTF
	
	These are user-defined table-generating functions that also operate rowwise,
	but they produce multiple rows/tables as a result, such as the EXPLODE function.
	UDTF can be used either after SELECT or after the LATERAL VIEW statement.
	
	11.4 Serde
	
	SerDe stands for Serializer and Deserializer. It is the technology that Hive uses to process
	records and map them to column data types in Hive tables. To explain the scenario of
	using SerDe, we need to understand how Hive reads and writes data.
	The process to read data is as follows:
	
		1. Data is read from HDFS.
		2. Data is processed by the INPUTFORMAT implementation, which defines the input data split and key/value records. In Hive, we can use CREATE TABLE… STORED AS
		<FILE_FORMAT> (see Chapter 7, Performance Considerations, for available file formats) to specify which INPUTFORMAT it reads from.
		3. The Java Deserializer class defined in SerDe is called to format the data into a record that maps to column and data types in a table.
		
	LazySimpleSerDe: The default built-in SerDe (org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe) that’s used with the TEXTFILE format. It can be implemented as follows:
		CREATE TABLE test_serde_lz STORED AS TEXTFILE AS SELECT name from employee;	
		
	ColumnarSerDe: This is the built-in SerDe used with the RCFILE format. It can be used as follows:
		CREATE TABLE test_serde_cs ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFile AS SELECT name from employee;	
	
	RegexSerDe: This is the built-in Java regular expression SerDe to parse text files. It can be used as follows:	
		CREATE TABLE test_serde_rex(name string,sex string,age string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
		WITH SERDEPROPERTIES('input.regex' = '([^,]*),([^,]*),([^,]*)','output.format.string' = '%1$s %2$s %3$s') STORED AS TEXTFILE;
		
	HBaseSerDe: This is the built-in SerDe to enable Hive to integrate with HBase. We can store Hive tables in HBase by leveraging this SerDe. Make sure to have HBase
	installed before running the following query:
		CREATE TABLE test_serde_hb(id string, name string, sex string, age string)
		ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
		WITH SERDEPROPERTIES ("hbase.columns.mapping"=":key,info:name,info:sex,info:age")
		TBLPROPERTIES("hbase.table.name" = "test_serde");	
		
	AvroSerDe: This is the built-in SerDe that enables reading and writing Avro data in Hive tables. Avro is a remote procedure call and data
		serialization framework. Since Hive 0.14.0, Avro-backed tables can simply be created by using the CREATE TABLE… STORED AS AVRO statement, as follows:

		CREATE TABLE test_serde_avro(name string, sex string, age string)
		ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT
		'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'	
	
	ParquetHiveSerDe: This is the built-in SerDe (parquet.hive.serde.ParquetHiveSerDe) that enables reading and writing the Parquet data format since Hive 0.13.0. It can be used as follows:
		CREATE TABLE test_serde_parquet STORED AS PARQUET AS SELECT name from employee;
		
	OpenCSVSerDe: This is the SerDe to read and write CSV data. It comes as a built-in SerDe since Hive 0.14.0. We can also install the implementation from other open
		source libraries, such as https://github.com/ogrodnek/csv-serde. It can be used as follows:
		CREATE TABLE test_serde_csv(name string, sex string, age string)
		ROW FORMAT SERDE
		'org.apache.hadoop.hive.serde2.OpenCSVSerde'
		STORED AS TEXTFILE;	
		
	JSONSerDe: This is a third-party SerDe to read and write JSON data records with Hive. Make sure to install it before running the following query:
		CREATE TABLE test_serde_js(name string, sex string,age string)
		ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
		STORED AS TEXTFILE;	
	
	11.5 Custom InputFormat and OutFormat
	11.6 Storage Handler
	11.7 Streaming
	
	Hive can also leverage the streaming feature in Hadoop to transform data in an alternative
	way. The streaming API opens an I/O pipe to an external process (script). Then, the
	process reads data from the standard input and writes the results out through the standard
	output. In Hive, we can use TRANSFORM clauses in HQL directly, and embed the mapper
	and the reducer scripts written in commands, shell scripts, Java, or other programming
	languages.
	
	
	

12. Hive Performance

	Since Hive 0.13.0, Hive includes the following new features for performance optimizations:
	
	Tez: 
	
	Tez (http://tez.apache.org/) is an application framework built on Yarn that can
	execute complex directed acyclic graphs (DAGs) for general data-processing tasks.
	Tez further splits map and reduce jobs into smaller tasks and combines them in a
	flexible and efficient way for execution. Tez is considered a flexible and powerful
	successor to the MapReduce framework. To configure Hive to use Tez, we need to
	overwrite the following settings from the default MapReduce:
	
		SET hive.execution.engine=tez;
		
	Vectorization: 
	Vectorization optimization processes a larger batch of data at the same
	time rather than one row at a time, thus significantly reducing computing overhead.
	Each batch consists of a column vector that is usually an array of primitive types.
	Operations are performed on the entire column vector, which improves the
	instruction pipelines and cache usage. Files must be stored in the Optimized Row Columnar (ORC) format in order to use vectorization. To
	enable vectorization, we need to do the following setting:
		
		SET hive.vectorized.execution.enabled=true;

13. Hive Trouble Shooting

14. Hive Bucket

 One thing buckets are used for is to increase load performance

 SELECT performance ( predicate pushdown )

Buckets can help with the predicate pushdown since every value belonging to one value will end up in one bucket. So if you bucket by 31 days and 
filter for one day Hive will be able to more or less disregard 30 buckets. Obviously this doesn't need to be good since you often WANT parallel 
execution like aggregations. So it depends on your query if this is good. It might be better to sort by day and bucket by something like customer 
id if you have to have buckets for some of the other reasons.

- Join Performance ( bucket join )

Buckets can lead to efficient joins if both joined tables are bucketed on the join key since he only needs to join bucket with bucket. This was big 
in the old times but is not that applicable anymore with cost based optimization in newere Hive versions ( since the optimizer already is very good at 
choosing mapside vs shuffle join and a bucket join can actually stop him from using the better one.

- Sampling performance

Some sample operations can get faster with buckets.

So to summarize buckets are a bit of an older concept and I wouldn't use them unless I have a clear case for it. The join argument is not that 
applicable anymore, the increased load performance also is not always relevant since you normally load single partitions where a map only load is often 
best. Select pushdown can be enhanced but also hindered depending how you do it and a SORT by is normally better during load ( see document ). 
And I think sampling is a bit niche. 


15. Hive Transaction

Since	Hive	version	0.13.0,	Hive	fully	supports	row-level	transactions	by	offering	full Atomicity,	Consistency,	Isolation,	
and	Durability	(ACID)	to	Hive.	For	now,	all	the transactions	are	autocommuted	and	only	support	data	in	the	Optimized	
Row	Columnar (ORC)	file	(available	since	Hive	0.11.0)	format	and	in	bucketed	tables. 

The	following	configuration	parameters	must	be	set	appropriately	to	turn	on	transaction support	in	Hive: 
SET	hive.support.concurrency	=	true; 
SET	hive.enforce.bucketing	=	true; 
SET	hive.exec.dynamic.partition.mode	=	nonstrict; 
SET	hive.txn.manager	=	org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; 
SET	hive.compactor.initiator.on	=	true; 
SET	hive.compactor.worker.threads	=	1; 

The	SHOW	TRANSACTIONS	command	is	added	since	Hive	0.13.0	to	show	currently	open	and aborted	transactions	
in	the	system: 

jdbc:hive2://>	SHOW	TRANSACTIONS; 



15. Index

	Hive supports index creation on tables/partitions since Hive 0.7.0. The index in
	Hive provides key-based data view and better data access for certain operations, such as
	WHERE, GROUP BY, and JOIN. We can use index is a cheaper alternative than full table scans.
	
	The command to create an index in Hive is straightforward as follows:
	
	CREATE INDEX idx_id_employee_id ON TABLE employee_id (employee_id) AS 'COMPACT' WITH DEFERRED REBUILD;
	
	In addition to the COMPACT keyword used in the preceding example, Hive also supports BITMAP indexes since HIVE 0.8.0 for columns with
	less different values, as shown in the following example:
	
	CREATE INDEX idx_sex_employee_id ON TABLE employee_id (sex_age) AS 'BITMAP' WITH DEFERRED REBUILD;

	The WITH DEFERRED REBUILD keyword in the preceding example prevents the index from immediately being built
	
	When data in the base table changes, the ALTER…REBUILD command must be used to bring the index up to date
	
	ALTER INDEX idx_id_employee_id ON employee_id REBUILD;
	
	
15. File format

	Hive supports TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET file formats. The three ways to specify the file format are as follows:
	
	CREATE TABLE… STORE AS <File_Format>
	ALTER TABLE… [PARTITION partition_spec] SET FILEFORMAT <File_Format>
	SET hive.default.fileformat=<File_Format> --default fileformat for table
	Here, <File_Type> is TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET.

	We can load a text file directly to a table with the TEXTFILE format. To load data to the table with other file formats, we need to load the data to a TEXTFILE format table first.
	Then, use INSERT OVERWRITE TABLE <target_file_format_table> SELECT * FROM <text_format_source_table> to convert and insert the data to the file format as expected.
	
	The file formats supported by Hive and their optimizations are as follows:
	
	TEXTFILE: 
	
	This is the default file format for Hive. Data is not compressed in the text file. It can be compressed with compression tools, such as GZip, Bzip2, and Snappy.
	However, these compressed files are not splittable as input during processing. As a
	result, it leads to running a single, huge map job to process one big file.

	SEQUENCEFILE: 
	
	This is a binary storage format for key/value pairs. The benefit of a sequence file is that it is more compact than a text file and fits well with the
	MapReduce output format. Sequence files can be compressed on record or block
	level where block level has a better compression ratio. To enable block level compression, we need to do the following settings:
		SET hive.exec.compress.output=true;
		SET io.seqfile.compression.type=BLOCK;
		
	Unfortunately, both text and sequence files as a row level storage file format are not an optimal solution since Hive has to read a full row even if only one column is being
	requested. For instance, a hybrid row-columnar storage file format, such as RCFILE, ORC, and PARQUET implementation, is created to resolve this problem.
	
	RCFILE: 
	
	This is short for Record Columnar File. It is a flat file consisting of binary key/value pairs that shares much similarity with a sequence file. The RCFile splits
	data horizontally into row groups. One or several groups are stored in an HDFS file. Then, RCFile saves the row group data in a columnar format by saving the first
	column across all rows, then the second column across all rows, and so on. This format is splittable and allows Hive to skip irrelevant parts of data and get the results
	faster and cheaper.
	
	ORC: 
	
	This is short for Optimized Row Columnar. It is available since Hive 0.11.0. The ORC format can be considered an improved version of RCFILE. It provides a larger
	block size of 256 MB by default (RCFILE has 4 MB and SEQUENCEFILE has 1 MB) optimized for large sequential reads on HDFS for more throughput and fewer files to
	reduce overload in the namenode. Different from RCFILE that relies on metastore to know data types, the ORC file understands the data types by using specific encoders so
	that it can optimize compression depending on different types. It also stores basic statistics, such as MIN, MAX, SUM, and COUNT, on columns as well as a lightweight index
	that can be used to skip blocks of rows that do not matter.
	
	PARQUET: 
	
	This is another row columnar file format that has a similar design to that of ORC. What’s more, Parquet has a wider range of support for the majority projects in
	the Hadoop ecosystem compared to ORC that only supports Hive and Pig. Parquet leverages the design best practices of Google’s Dremel to support the nested structure of
	data. Parquet is supported by a plugin since Hive 0.10.0 and has got native support since Hive 0.13.0.

15. Compression
	Compression techniques in Hive can significantly reduce the amount of data transferring
	between mappers and reducers by proper intermediate output compression as well as
	output data size in HDFS by output compression. As a result, the overall Hive query will
	have better performance. To compress intermediate files produced by Hive between
	multiple MapReduce jobs, we need to set the following property (false by default) in the
	Hive CLI or the hive-site.xml file:
	
		SET hive.exec.compress.intermediate=true	
		SET hive.exec.compress.output=true
		SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
		
	Compression Codec Extension Splittable
	Deflate org.apache.hadoop.io.compress.DefaultCodec .deflate N
	GZip org.apache.hadoop.io.compress.GzipCodec .gz N
	Bzip2 org.apache.hadoop.io.compress.BZip2Codec .gz Y
	LZO com.hadoop.compression.lzo.LzopCodec .lzo N
	LZ4 org.apache.hadoop.io.compress.Lz4Codec .lz4 N
	Snappy org.apache.hadoop.io.compress.SnappyCodec .snappy N	
	
15. Merging Small Files	

	For Hive, we can do the following configurations for merging files of query results to avoid recreating small files:
	
	hive.merge.mapfiles: This merges small files at the end of a map-only job. By default, it is true.
	hive.merge.mapredfiles: This merges small files at the end of a MapReduce job. Set it to true since its default is false.
	hive.merge.size.per.task: This defines the size of merged files at the end of the job. The default value is 256,000,000.
	hive.merge.smallfiles.avgsize: This is the threshold for triggering file merge. The default value is 16,000,000.
	
	When the average output file size of a job is less than the value specified by hive.merge.smallfiles.avgsize, and both hive.merge.mapfiles (for map-only jobs)
	and hive.merge.mapredfiles (for MapReduce jobs) are set to true, Hive will start an additional MapReduce job to merge the output files into big files.


15. Local mode

	Hadoop can run in standalone, pseudo-distributed, and fully distributed mode. Most of the time, we need to configure Hadoop to run in fully distributed mode. When the data to
	process is small, it is an overhead to start distributed data processing since the launching time of the fully distributed mode takes more time than the job processing time. Since
	Hive 0.7.0, Hive supports automatic conversion of a job to run in local mode with the following settings:
		
		SET hive.exec.mode.local.auto=true; --default false
		SET hive.exec.mode.local.auto.inputbytes.max=50000000;
		SET hive.exec.mode.local.auto.input.files.max=5;
		--default 4
		
	A job must satisfy the following conditions to run in the local mode:
	The total input size of the job is lower than
	hive.exec.mode.local.auto.inputbytes.max
	The total number of map tasks is less than
	hive.exec.mode.local.auto.input.files.max
	The total number of reduce tasks required is 1 or 0
	
15. JVM reuse

	By default, Hadoop launches a new JVM for each map or reduce job and runs the map or reduce task in parallel. When the map or reduce job is a lightweight job running only for a
	few seconds, the JVM startup process could be a significant overhead. The MapReduce framework (version 1 only, not Yarn) has an option to reuse JVM by sharing the JVM to
	run mapper/reducer serially instead of parallel. JVM reuse applies to map or reduce tasks in the same job. Tasks from different jobs will always run in a separate JVM. To enable
	the reuse, we can set the maximum number of tasks for a single job for JVM reuse using the mapred.job.reuse.jvm.num.tasks property. Its default value is 1:
		
		SET mapred.job.reuse.jvm.num.tasks=5;
	We can also set the value to –1 to indicate that all the tasks for a job will run in the same JVM.
	
15. Parallel execution

	Hive queries commonly are translated into a number of stages that are executed by the default sequence. These stages are not always dependent on each other. Instead, they can
	run in parallel to save the overall job running time. We can enable this feature with the following settings:
		SET hive.exec.parallel=true;—default false
		SET hive.exec.parallel.thread.number=16;
		
		-- default 8, it defines the max number for running in parallel
	
	Parallel execution will increase the cluster utilization. If the utilization of a cluster is
	already very high, parallel execution will not help much in terms of overall performance.	
	
15. Join optimization
	
	Common join
	
	The common join is also called reduce side join. It is a basic join in Hive and works for
	most of the time. For common joins, we need to make sure the big table is on the rightmost
	side or specified by hit, as follows:
	/*+ STREAMTABLE(stream_table_name) */.
	
	Map join
	
	Map join is used when one of the join tables is small enough to fit in the memory, so it is very fast but limited. Since Hive 0.7.0, Hive can convert map join automatically with the
	following settings:
	
		SET hive.auto.convert.join=true; --default false
		SET hive.mapjoin.smalltable.filesize=600000000;
		--default 25M
		
		SET hive.auto.convert.join.noconditionaltask=true;
		--default false. Set to true so that map join hint is not needed
		
		SET hive.auto.convert.join.noconditionaltask.size=10000000;
		--The default value controls the size of table to fit in memory
		
	Once autoconvert is enabled, Hive will automatically check if the smaller table file size is
	bigger than the value specified by hive.mapjoin.smalltable.filesize, and then Hive
	will convert the join to a common join. If the file size is smaller than this threshold, it will
	try to convert the common join into a map join.
	
	Bucket map join
	
	Bucket map join is a special type of map join applied on the bucket tables. To enable bucket map join, we need to enable the following settings:
		SET hive.auto.convert.join=true; --default false
		SET hive.optimize.bucketmapjoin=true; --default false
		
	In bucket map join, all the join tables must be bucket tables and join on buckets columns.
	In addition, the buckets number in bigger tables must be a multiple of the bucket number
	in the small tables.
	
	Sort merge bucket (SMB) join
	
	SMB is the join performed on the bucket tables that have the same sorted, bucket, and join
	condition columns. It reads data from both bucket tables and performs common joins (map
	and reduce triggered) on the bucket tables. We need to enable the following properties to
	use SMB:
	
	SET hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
	SET hive.auto.convert.sortmerge.join=true;
	SET hive.optimize.bucketmapjoin=true;
	SET hive.optimize.bucketmapjoin.sortedmerge=true;
	SET hive.auto.convert.sortmerge.join.noconditionaltask=true;
	
	Sort merge bucket map (SMBM) join
	
	SMBM join is a special bucket join but triggers map-side join only. It can avoid caching
	all rows in the memory like map join does. To perform SMBM joins, the join tables must
	have the same bucket, sort, and join condition columns. To enable such joins, we need to
	enable the following settings:
	
		SET hive.auto.convert.join=true;
		SET hive.auto.convert.sortmerge.join=true
		SET hive.optimize.bucketmapjoin=true;
		SET hive.optimize.bucketmapjoin.sortedmerge=true;
		SET hive.auto.convert.sortmerge.join.noconditionaltask=true;
		SET hive.auto.convert.sortmerge.join.bigtable.selection.policy=org.apache.hadoop.hive.ql.optimizer.TableSizeBasedBigTableSelectorForAutoSM
		
	Skew join

	When working with data that has a highly uneven distribution, the data skew could happen in such a way that a small number of compute nodes must handle the bulk of the
	computation. The following setting informs Hive to optimize properly if data skew happens:

		SET hive.optimize.skewjoin=true;
		--If there is data skew in join, set it to true. Default is false.
		
		SET hive.skewjoin.key=100000;
		--This is the default value. If the number of key is bigger than
		--this, the new keys will send to the other unused reducers.	

	Skew data could happen on the GROUP BY data too. To optimize it, we need to do the
	following settings to enable skew data optimization in the GROUP BY result:
		SET hive.groupby.skewindata=true;	
		
		
	
	
16. Performance utilities: Explanin and Analyze

	Hive provides an EXPLAIN command to return a query execution plan without running the
	query. We can use an EXPLAIN command for queries if we have a doubt or a concern about
	performance. The EXPLAIN command will help to see the difference between two or more
	queries for the same purpose. The syntax for EXPLAIN is as follows:

	EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] hive_query
	
	EXPLAIN SELECT sex_age.sex, count(*) FROM employee_partitioned WHERE year=2014 GROUP BY sex_age.sex LIMIT 2;


	Hive statistics are a collection of data that describe more details, such as the number of
	rows, number of files, and raw data size, on the objects in the Hive database. Statistics is a
	metadata of Hive data. Hive supports statistics at the table, partition, and column level.
	These statistics serve as an input to the Hive Cost-Based Optimizer (CBO), which is an
	optimizer to pick the query plan with the lowest cost in terms of system resources required
	to complete the query.
	
	The statistics are gathered through the ANALYZE statement since Hive 0.10.0 on tables,
	partitions, and columns as given in the following examples:
	
	ANALYZE TABLE employee COMPUTE STATISTICS;

	ANALYZE TABLE employee_partitioned PARTITION(year=2014, month=12) COMPUTE STATISTICS;

	ANALYZE TABLE employee_id COMPUTE STATISTICS FOR COLUMNS employee_id;

16. Hive logs

	Logs provide useful information to find out how a Hive query/job runs. By checking the
	Hive logs, we can identify runtime problems and issues that may cause bad performance.
	There are two types of logs available in Hive: system log and job log.
	The system log contains the Hive running status and issues. It is configured in
	{HIVE_HOME}/conf/hive-log4j.properties. The following three lines for Hive log can be found:
		hive.root.logger=WARN,DRFA
		hive.log.dir=/tmp/${user.name}
		hive.log.file=hive.log

	To modify the status, we can either modify the preceding lines in hive-log4j.properties
	(applies to all users) or set from the Hive CLI (only applies to the current user and current session) as follows:
	hive --hiveconf hive.root.logger=DEBUG,console
	
	The job log contains Hive query information and is saved at the same place,
	/tmp/${user.name}, by default as one file for each Hive user session. We can override it
	in hive-site.xml with the hive.querylog.location property. If a Hive query generates
	MapReduce jobs, those logs can also be viewed through the Hadoop JobTracker Web UI.	

17. Hive: Loading Data

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

18. Efficient Hive queries

Data Layout (Partitions and Buckets)
Data Sampling (Bucket and Block sampling)
Data Processing (Bucket Map Join and Parallel execution)

https://www.qubole.com/blog/big-data/5-tips-for-efficient-hive-queries/

19. Apache Hive: The .hiverc file

What is .hiverc file?
It is a file that is executed when you launch the hive shell - making it an ideal place for adding any hive configuration/customization you want set, 
on start of the hive shell. This could be:
- Setting column headers to be visible in query results
- Making the current database name part of the hive prompt
- Adding any jars or files
- Registering UDFs

.hiverc file location
The file is loaded from the hive conf directory.
I have the CDH4.2 distribution and the location is: /etc/hive/conf.cloudera.hive1
If the file does not exist, you can create it.
It needs to be deployed to every node from where you might launch the Hive shell.
[Note: I had to create the file;  The distribution did not come with it.]

Sample .hiverc
add jar /home/airawat/hadoop-lib/hive-contrib-0.10.0-cdh4.2.0.jar;
set hive.exec.mode.local.auto=true;
set hive.cli.print.header=true;
set hive.cli.print.current.db=true;
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=30000000;

20. Hive on Spark Engines

	set spark.home=/usr/hdp/2.4.0.0-169/spark;
	set spark.master=local;
	SET hive.execution.engine=spark;
	set spark.master=yarn-client;
	set spark.eventLog.enabled=true;
	set spark.executor.memory=512m;
	set spark.executor.cores=2;
	
	Added spark-assembly jar file as shown below:

	ADD jar /usr/hdp/2.4.0.0-169/spark/lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar to Hive Lib;
	
	trouble shooting
	
	hive --hiveconf hive.root.logger=DEBUG,console
	

21. Hive Hook

	21.1 What is a hook? 
		As you know, this is about computer programming technique, but .. 
		• Hooking - Techniques for intercepting function calls or messages or events in an operating system, applications, and other software components. 
		• Hook - Code that handles intercepted function calls, events or messages
		
	21.2 Hive provides some hooking points 
		• pre-execution 
		• post-execution 
		• execution-failure 
		• pre- and post-driver-run 
		• pre- and post-semantic-analyze 
		• metastore-initialize	
		
	21.3 How to set up hooks in Hive 
		hive-site.xml
		
		<property> 
			<name>hive.exec.pre.hooks</name> 
			<value></value> 
			<description> Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface. </description> 
		</property> 
		
		<property>
			<name>hive.exec.post.hooks</name>
			<value>org.apache.hadoop.hive.ql.hooks.ATSHook, org.apache.atlas.hive.hook.HiveHook</value>
		</property>		
		
		<property>
			<name>hive.exec.failure.hooks</name>
			<value>org.apache.hadoop.hive.ql.hooks.ATSHook</value>
		</property>		
		
		<property> 
			<name>hive.aux.jars.path</name> 
			<value></value> 
		</property> 
		
		Setting hook property Setting path of jars contains implementations of hook interfaces or abstract class You can use hive.added.jars.path instead of hive.aux.jars.path	
		
22. BEST PRACTICES IN APACHE HIVE		

	22.1 Use of Partitions

	Earlier database experts used to create a table per day means they used to create a complete new table for every single day. Though this practice needs lots of maintenance, people still use this in real time scenarios. But in case of hive, we have the concept of partitions which helps in maintaining a new partition for every new day without putting many efforts. As we have already seen, in case of partitions hive creates new folder and store the records accordingly which helps in targeting only required area and to avoid scanning the complete table.
	
	22.2 Over Partitioning

	Even though there are various benefits of creating partitions in table, like we say, anything in access in dangerous. When hive creates a partition for a table, it has to maintain the extra metadata to redirect query as per partition. So if in any case we get too many partitions a table, it would get difficult for hive to cater to this situation. So it is very important to understand the data growth and the kind of data we are going to get so that we can plan our schemas. Also it's very important to select correct columns for partitioning after completely understanding the kind of queries we want to on that data. As a partition for some queries would be beneficial at the same time it could affect the performance of other queries badly. As we know HDFS is beneficial when we have smaller set of large files instead of larger set of smaller files.
	
	22.3 Normalization

	Like any other SQL engines, we don't have any primary keys and foreign keys in Hive as hive is not meant to run complex relational queries. It's used to get data in easy and efficient manner. So while designing hive schema, we don't need to bother about selecting unique key etc. Also we don't need to bother about normalizing the set of data for efficiency.

	By keeping denormalized data, we avoid multiple disk seeks which is generally the case when we have foreign key relations. Here by not doing this, we avoid multiple I/O operation which ultimately helps in performance benefits.
	
	22.4 Efficient Use of Single Scan
	
	As we all know, hive does complete table scan in order to process a certain query. So it's recommend to use that single scan to perform multiple operations. Take a look at the following queries

	INSERT INTO page_views_20140101 
	SELECT * FROM page_views WHERE date='20140101';
	And 
	INSERT INTO page_views_20140102 
	SELECT * FROM page_views
	WHERE date='20140102';
	
	22.5 Use of Bucketing

	Bucketing is a similar optimization technique as partitioning but looking at the concerns of over partitioning; we can always go for system defined data segregation. Buckets distribute the data load into user defined set of clusters by calculating hash code of key mentioned in query. Bucketing is useful when it is difficult to create partition on a column as it would be having huge variety of data in that column on which we want to run queries.

	One such example would be page_views table where we want to run queries on user_id but looking at the no. users it would get difficult to create separate partition for each and every user. So in this case we can create buckets on user_id.
	
	
	
